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Abstract: The dynamic power requirement of CMOS circuits is rapidly becoming a 
major concern in the design of personal information systems and large computers. In this 
work we present a number of new CMOS logic families, Charge Recovery Logic (CRL) as 
well as the much improved Split-Level Charge Recovery Logic (SCRL), within which the 
transfer of charge between the nodes occurs quasistatically. Operating quasistatically, these 
logic families have an energy dissipation that drops linearly with operating frequency, i.e., 
their power consumption drops quadratically with operating frequency as opposed to the 
linear drop of conventional CMOS. The circuit techniques in these new families rely on 
constructing an explicitly reversible pipelined logic gate, where the information necessary 
to recover the energy used to compute a value is provided by computing its logical inverse. 
Information necessary to uncompute the inverse is available from the subsequent inverse 
logic stage. We demonstrate the low energy operation of SCRL by presenting the results 
from the testing of the first fully quasistatic 8x8 multiplier chip (SCRL-1) employing 
SCRL circuit techniques. 
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Part I 



Introduction and Background 



1. Introduction 



1.1 New Ideas, Old Physics 

In principle, a computing engine need not dissipate any energy as shown in the work of 
Bennett, Feynman, and Landauer [3] [11] [27]. Although these authors approach the prob- 
lem from different disciplines and use different physical as well as theoretical models, they 
all conclude that the transfer of energy through a dissipative medium dissipates arbitrarily 
small amounts of energy if that transfer is made sufficiently slowly. 

In CMOS based circuits, the node voltages represent one of two logical values. Dur- 
ing operation, the logical values of these nodes, and their voltage levels, repeatedly toggle 
between the two valid levels. Given the capacitances of these nodes, we can view per- 
forming computation using CMOS circuits as moving charge from one node to another. In 
other words, to perform useful work in CMOS, we are continuously forced to place and 
remove charge from various nodes in the circuit. Charge transfer between nodes of differing 
potentials is similar to shuttling heat between two heat baths at differing temperatures. 

Thermodynamically, we know that all real energy transfer operations are invariably 
irreversible. Fortunately, however as much an idealization reversible processes are, they are 
still of extreme importance since it is possible in many situations to achieve them to a very 
high degree of approximation. 

As an example, it is possible to shuttle energy between two heat baths at different 
temperatures while losing arbitrarily small amounts of energy. This is done by inserting 
an infinite number of heat baths between the two original ones such that the temperature 
difference between any two adjacent baths becomes infinitesimally small. Shuttling heat 
packets reversibly between the baths at the two extremes following this gradual temperature 
staircase results in arbitrarily small energy loss. 

In general, if lossless reversibility is to be achieved, a process must be carried out at a 
slow enough rate so that in effect, the system is always in equilibrium. In this light, the 
reversible process may be regarded as a series of quasistatic changes along a sequence of 
neighboring equilibrium states. 

1.2 Quasistatic Switching 

Currently the power consumption of CMOS circuits drops linearly with lower operating 
frequency. This means that the energy consumed per cycle is constant since the cycle 
is inversely proportional to frequency. Typically in CMOS, each cycle contains the same 
amount of computational work and on the average the same amount of charge shuttling. 
This suggests that in conventional CMOS, the energy consumed per charge movement is 
always constant. This is analogous to the the worst case scenario in our thermal example 
in which there were no additional intermediate thermal baths and the transfers between 



the two baths were done in one step. The reason for the high dynamic dissipation of 
conventional CMOS is the fact that charge transfer within them happens abruptly, i.e., 
not quasistatically. The time constant associated with charging a gate through a similar 
transistor is RC, where R is the ON resistance of the device and C its input capacitance. 
However, the cycle time can be, and usually is, much larger than RC . An obvious conclusion 
is that energy consumption can be reduced by spreading the transitions over the whole cycle 
thus making them closer to quasistatic processes rather than "squeezing" them all inside 
one RC . 

To asymptotically reduce the energy dissipation in CMOS all of the charge movements 
through the circuits must proceed quasistatically. To achieve this quasistatic operation, one 
has to guarantee absolute adherence to two conditions. The first is to guarantee that charge 
flow between any two nodes in the circuit occurs in a gradual and externally controlled 
manner. This means that we forbid any device in our circuit from turning on while there 
is a potential difference across it. It also means that once the device is turned on, the 
movement of charge through it must be done in a gradual and controlled manner so as 
to prevent a potential difference from developing. The second is to guarantee that the 
path followed by the charge does not contain any parts that violate quasistatic behavior. 
This means that the circuit should not contain any non-linear dissipative elements, e.g., 
diodes. Once these conditions are guaranteed, the dissipation could be set to a level or 
asymptotically reduced through external control of the rate of charge movement. This is 
true since the two conditions assure quasistatic energy transfer and it is only through that 
that asymptotic energy reduction is possible. We want to state here that there is no way 
to guarantee the two conditions stated above without employing reversible logic. 

In a CMOS circuit, we can always determine and control the potential on one side of 
a CMOS device since it is usually connected to a power supply rail. The potential on 
the other side, however, depends solely on the result of the computation. To perform a 
non-dissipative transition of the output, we must know the state of the output prior to 
and during this output transition. The reason for the need to know the previous state of 
a node before moving it is quite simple. Suppose that we needed to set the voltage on an 
internal node to Vdd- To do it quasistatically, we connect it to a rail that is currently at 
the same voltage, then we slowly ramp the voltage on the rail to Vdd, setting the voltage 
on the internal node to Vdd in the process. If the node was at GND, our rail would swing 
from GND to Vdd- In contrast, if the node was at Vdd, we would connect the node to a 
rail that is always anchored to Vdd- To determine which rail to connect to, we have to 
know the previous value as well as the desired final value of the node voltage. Furthermore, 
we have to hold on to this information throughout the transition. Stated more clearly, to 
non-dissipatively set the state of the output we must at all times have a copy of it. The 
only way out of this circle is to use reversible logic. 

Recent and independent work by Hall [16] and Merkle [33] showed how to connect 
Retractile Cascade stages to eliminate the power dissipation in the latches that were used to 
hold on to past values. These two proposals are worthy of note since they are the only ones 
that conform to the two conditions of quasistatic operation outlined above, however both 
proposals were rather sketchy when it came to the details of the physical implementation 
of their proposed logic. 



1.3 What is Reversible Logic 

Reversible logic is a way to perform computation where information entropy is strictly 
conserved, or in some implementations mostly conserved. In our context we define constant 
information entropy as always keeping enough information around to be able to accurately 
retrace all the events, or steps, of the past. The main drive for implementing reversible logic 
is that of drastically reducing the energy dissipation of computing circuits. As is widely es- 
tablished by now, information entropy and thermodynamic entropy are linked. One cannot 
increase information entropy without dissipating energy. Fortunately the reverse is just as 
true. That is, if information entropy is not increased, it is theoretically possible, and as 
shown here practical, to perform computation while dissipating asymptotically vanishing 
amounts of energy if the computation is carried out asymptotically slowly. A trivial way 
of not increasing the information entropy of the system would be to store a copy of every 
transaction of the system forever. This obviously requires an infinite amount of storage 
space. A more practical approach to maintaining constant information entropy would be 
to undo, or reverse, the effect of a computation once the results of that computation had 
been utilized, i.e., to perform reversible logic operations. To resign oneself to perform 
computation reversibly, is to empower oneself to perform them while dissipating orders of 
magnitude less energy than would be dissipated by conventional methods. This is because 
reversible computation provides us with the needed information to make the correct con- 
nection to the restoring swinging rail and thus allows us to restore nodes quasistatically. 
We define the term asymptotic energy reduction as the ability to perform a computation 
while consuming asymptotically less and less energy as the computation is performed slower 
and slower with no theoretical limit on how small the consumption can get. If there was 
a limit, we do not consider that operation to be asymptotic in energy reduction. In other 
words, the theoretical line for energy dissipation associated with the process must in the 
limit reach zero. 

It is the fact that dissipation could be reduced by orders of magnitude using reversible 
logic that convinced us of the need for it. We note here that reversible logic is less restric- 
tive than conservative logic that was proposed by Fredkin and Toffoli in [13]. In addition 
to preserving the information content, conservative logic also preserves the total number of 
ones and zeros of the system. As it turns out, conservative logic simplifies some mechan- 
ical implementation of reversible logic, due to its dual polarity signaling. However, only 
reversibility is needed for reducing the energy consumption. In a sense our implementa- 
tions share their reversible aspect with the reversible and conservative Fredkin gate. Our 
implementations however differ in their lack of conservation for ones and zeros. 

Intuitively, and in contrast to the above, a circuit that has no means of computing the 
logical inverses of its functions has no means of preserving the information content of its 
nodes and hence should not be thought of as reversible logic. 

1.4 Temporal Reversibility 

A number of other proposals sought to lower the energy dissipation by charging and 
discharging the internal nodes of their circuits in a gradual and controlled manner. In their 



work, Roller and Athas [26] were largely concerned with reducing the energy consumption 
of bus drivers. Since in their circuits, the input to the bus driver was held stable during the 
SETting and RESETting of the gate, reversibility was in effect performed temporally and 
hence they were able to achieve asymptotic energy reduction for their bus driver without 
needing reversible logic. The key here is that the circuit holding the input stable during 
SETting and RESETting of the rails provided the "infinite storage space" within which 
the history of the computation is recorded. Roller and Athas however, correctly recognized 
and reported that when the entire circuit was considered, there was an unavoidable energy 
dissipation in the pipeline registers. This of course was a consequence of not being able to 
satisfy the first condition in the pipeline registers due to the absence of reversible logic. We 
do acknowledge that achieving asymptotic energy reduction in selective parts of the circuit 
as Roller and Athas have demonstrated could go a long way towards reducing the power 
consumption of a system. But we note however that the reduced energy consumption would 
have a lower bound. 

1.5 Diode Based Proposals 

Previously, we stated the importance of knowing the previous value of a node while we 
are affecting it in order to achieve quasistatic switching of that node. The previous value 
was needed so that we can make the correct decision of which rail to connect the node to in 
order to affect quasistatic charge transfer. A simple diode can however correctly make that 
decision for us. Recently two separate proposals have been forwarded that achieve energy 
reduction in CMOS circuits through using diodes. The first is by Denker et al. [9] while the 
second is by Hinman and Schlecht [17]. Using a diode however violates the second of the 
two condition we have stated above. The fact is that charge transfer in a forward biased 
diode is not a quasistatic process even if the charge transfer proceeds slowly. 

This becomes evident when we trace the path of an electron through the diode. In a 
forward biased diode, the energy of electrons on the N side of the P-N junction is elevated 
by the forward voltage to a level that permits them to overcome the built-in potential 
barrier and diffuse over to the P side of the junction. Once on the P side, they rapidly fall 
down the potential hill to equalize with the energy of the electrons on the P side. It is this 
rapid and uncontrollable falling down the potential hill that is not quasistatic in the P-N 
junction. The fact that the "height" of the fall is always constant irrespective of the rate at 
which charge is allowed to flow through the diode is the reason for the constant and non- 
linear value of the diode's V{, e - From this we see that charging a node with a capacitance C 
through a diode dissipates an amount of energy, E 'diode that is equal to CVddVbei where V& e 
is the forward potential drop of the diode. At best, diode based methods can reduce the 
energy consumption by a factor equal to Vdd/Vb e . This puts a limit on the energy reduction 
ratio over conventional CMOS circuit, usually no more than 10. Further, with the current 
technology push to reduce Vdd f° r CMOS circuits, in some cases below 200mV [4], the 
energy saving factor of diode based circuits will only get smaller or disappear entirely. 



1.6 Charge Sharing versus Quasistatic Operation 

Some of the proposed low energy circuits techniques suffer from allowing charge sharing 
among the internal nodes, which violates the first condition for quasistatic operation. 

I stress here that the construction of quasistatic circuits should at all cost avoid charge 
sharing among the internal nodes since with charge sharing, the energy saving over conven- 
tional CMOS could easily be wiped out for all but the simplest of circuits. The point behind 
quasistatic switching is to recover most of the energy that at one point was deposited on 
the internal nodes of the circuit. Except for simple circuits, the effective capacitance of the 
internal nodes of a circuit becomes comparable to the capacitance of the output node. To 
allow charge sharing to occur on the internal nodes, is to allow a large and undetermined 
amount of the stored energy to be wasted. This could be as high as 50% of the total charge 
leading to a poor energy saving factor of less than 2. Furthermore, the indeterminacy of 
energy loss reflects itself in a variable effective capacitance of the circuit thus nullifing the 
advantage of constant supply capacitance that is provided by dual polarity designs. 

1.7 Logic Families and Universality 

For a circuit technique to be classified as a computing logic family, it must be universal. 
This means that it should contain at least one member that is non-monotonic and it must 
support negation. This is usually overlooked by new proposals for low energy circuits. As 
we have illustrates in a paper published earlier [41] and will revisit again in Section 3.2.1, 
initial attempts at quasistatic circuits usually need augmentation in order to support logical 
negation and hence be eligible to be considered a logic family. In practice new proposals that 
do not attempt to construct multiple stage pipelines with stages that are more complex 
than simple buffers or inverters do not detect the absence of universality in their "logic 
families". A powerful check to see if a new computing circuit technique is universal is to 
try and design a circuit that takes in a logical value at its input and that is able at a later 
stage in the pipeline to produce both the true and the complement copies of that input. 
Furthermore the circuit must be able to have them arrive simultaneously at a given stage 
in the pipeline. The power of this test is that in quasistatic circuits, one cannot insert 
an inverter inline with a signal to get its complement, as is frequently done in CMOS. An 
inserted gate would also have to be a controlled pipeline stage and that forces the proposed 
circuit technique to fail the test I have outline above. Experience will show that proposed 
logic families passing this test, are universal. 

The subject of this work is to try and apply the principles of reversible logic to CMOS 
circuits to achieve full quasistatic operation throughout the system and thus significantly 
reduce its energy consumption. 

1.8 Contributions of this Work 

In this document I present a number of new techniques for constructing non-dissipative 
quasistatic CMOS circuits. We feel that these techniques have a number of distinct advan- 
tages which warrant their use in future circuits. 



In Section 2, I start by examining the ways in which the energy consumed for each 
charge transfer in CMOS circuits is made arbitrarily small as the process proceeds qua- 
sistatically. Initially, I will start by examining the energy dissipation mechanisms in CMOS 
circuits. I will then attempt to analyze them more closely to identify ways in which charge 
transfer, and hence computation, can be done quasistatically. The discussion will include 
the dissipation in the computing circuits as well as the dissipation caused by the action of 
the semiconductor switches in the power supplies of the system. 

In Section 3, I describe a number of early implementations of Charge Recovery Logic 
(CRL) circuits. The common property of all of these circuits is their ability to do com- 
putations quasistatically thus consuming arbitrarily small amounts of energy when clocked 
sufficiently slowly. The discussion will include both Fully- Symmetric CRL as well as N- 
Channel CRL. It will also show how to string multiple CRL gates in a non- retracting 
pipeline [41]. Both Fully- Symmetric and N-Channel CRL were abandoned however in favor 
of our more recently discovered and much improved form of CRL which is examined in 
Section 4. 

In Section 4, I present a much improved family of CRL called Split-Level CRL (SCRL). 
This form uses 2 times as many devices as conventional CMOS, requires only one wire for 
every signal, and actively drives all outputs during sampling. Further, I will show how to 
construct Split-Level CRL circuits using only 2 external inductors for every chip. SCRL 
serves as the corner stone of our research since we based our demonstration chip design on 
its techniques. 

Conceptually, Split-Level CRL differs from earlier CRL in two ways. The first is the use 
of Split-Level voltages. The second is the elimination of the RESET devices and delegating 
the action of restoring the voltage on SET nodes to gates in the reverse pipeline. As in 
the previous Section, the discussion will include how to string multiple SCRL gates in a 
non- retracting pipeline [42]. 

The circuit techniques in these new CRL and SCRL logic families rely on constructing 
an explicitly reversible pipelined logic gate, where the information necessary to recover the 
energy used to compute a value is provided by computing its logical inverse. Information 
necessary to uncompute the inverse is available from the subsequent inverse logic stage. 

To verify the quasistatic operation and behavior of Split-Level CRL, we have fabricated 
and tested an 8 X 8 CMOS multiplier chip, labeled SCRL-1, that employed the circuit 
techniques of Split-Level Charge Recovery Logic. In Section 5, I describe the design of this 
demonstration chip. Following that, in Section 6, I describe the measurement techniques 
and their results that verified the lower energy consumption of SCRL-1 as a consequence 
of quasistatic operation through reversibility. 

To my knowledge, SCRL-1 is the first working implementation of a pipelined, reversible 
logic based, asymptotically zero energy circuit. As such, I am certain that there is a lot 
more for us to discover and refine than what we have reported so far. In Section 7, I try 
to give some suggestions about directions of future work that could further improve the 
applicability of SCRL. These include CAD and architecture issues, as well as the design of 
better quasistatic power supply switches. My opinion is that as far as quasistatic reversible 
computation is concerned, we've only just begun... 



2. Quasistatic Switching in CMOS 



2.1 Introduction 

The subject of this section is to show ways in which the energy consumed per charge 
transfer is made arbitrarily small as the process proceeds quasistatically. 

Initially, we will start by examining the energy dissipation mechanisms in CMOS cir- 
cuits. We will then attempt to analyze them more closely to identify ways in which charge 
transfer, and hence computation, can be done quasistatically. 

2.2 Energy Dissipation in CMOS 

As is widely known, the internal energy dissipation of conventional CMOS circuits is 
attributable to three major components. The first is due to the static leakage currents 
between the terminals of MOS devices. The second is due to the brief short between Vdd 
and V ss during switching which is caused by both N-Channel and P-Channel devices being 
simultaneously ON for a brief time during a swing of Vdd > 2Vr- The third is due to the 
transient current associated with charging and discharging the gate capacitance C through 
a device with ON resistance R. 

2.2.1 Dissipation Due to Leakage 

Dissipative leakage currents occur anytime circuit nodes at differing potentials are sep- 
arated by slightly conductive mediums. Such dissipative "sandwiching" is present in a 
number of locations in CMOS circuits. These locations can be grouped according to their 
leakage mechanisms into two groups. The first is the leakage due to the reverse current of 
PN junctions. The second is the subthreshold conduction current between the source and 
drain of any MOS device. 

The reverse current of a PN junction depends exponentially on temperature. Hence 
operating at lower temperatures greatly lowers this form of dissipation. In addition, a 
number of reversed biased junctions that are currently used for device isolation will become 
unnecessary with the advent of silicon-on-insulator fabrication technology (SOI). 

The case for subthreshold conduction is more complicated. Increasing the threshold 
voltage of the devices in the circuit reduces subthreshold conduction, thus lowering quiescent 
power consumption. Unfortunately, this increases the ON resistance of the devices as the 
difference between V? and the supply voltages decreases. To maintain the same speed 
performance, the devices must be made wider leading to higher dynamic dissipation. We 
will discuss dynamic dissipation in Section 2.2.3. Here, we stress the fact that in the 
case that subthreshold conduction becomes appreciable, we can trade some of it for higher 
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Figure 2.1: Energy analysis model of CMOS circuits. 

dynamic dissipation. Having done that, this thesis will show how to significantly lower 
dynamic dissipation of CMOS circuits leading to lower overall energy dissipation. 

2.2.2 Dissipation due to Vdd to V ss Shorting 

During switching, a short occurs between Vdd an d V ss caused by both N-Channel and 
P-Channel devices being simultaneously ON for a brief time for swings where Vdd > 2Uy [5]. 
We can approximate this switching dissipation by 

T r 1 Vdd Vdd - 2Vj „ , , 

Vdd x -— x — ^— x 2RC (2.1) 

which for a typical Vdd = 4Ux simplifies to \CV dd . As Vdd drops below 2Vt this switching 
dissipation becomes negligible. Unfortunately, the transfer curve begins to exhibit hysteresis 
thus limiting the utility of the gate at these power supply voltages. In addition, operating 
with Vdd < 2Vx increases subthreshold conduction and increases propagation delays. 

In Sections 3 and 4 we will see how this form of dissipation is completely eliminated in 
the family of CMOS circuits that we are proposing in this work. 

2.2.3 Dynamic Dissipation 

It is widely known that the energy dissipation that is associated with charging and 
discharging the node capacitance, C, of a CMOS gate through a device of ON resistance 
R is equal to C V dd per period for a rail-to-rail voltage equal to Vdd- We will rederive this 
result to gain an insight for the real reason of this dissipation. We use as our model the 
familiar circuit in Figure 2.1. The C in the figure is the capacitance of the driven node 
while the R is the ON resistance of the driving gate. We start with the capacitor voltage 
at zero. Using the voltage source we apply a step voltage of Vdd- This mimics the action of 
turning on a MOSFET that drives the gate of another. The current in this circuit follows 
i(t) = -^. e - t / RC an d hence the power consumed is P = -^e~ 2t ' . Integrating over the 
charging time we get 

T/2 r-oo ru/ 2 

F ■ — --M. I f> -2t/RC 'j f _ ° v dd ( 9 \ 

-^dynamic — t~> I — ^ \ J 

R Jt=o 2 

From the above we see that in order to charge the load to Vdd we need CV dd Joules of 
energy. Half of it is dissipated in the resistor during charging, and the other half is stored 
in the charged capacitor. The later part is not lost yet but would be if we discharge the 
capacitor in a similar fashion, i.e., by a voltage step in the voltage source. 
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2.3 Quasistatic Charge Transfer 

In essence the first two dissipation mechanisms discussed, leakage and Vdd-^o-V ss short, 
are related to the particular implementation of CMOS circuits. In general dissipation due 
to leakage is considerably smaller than dissipation by the other means. In addition, it 
could be made even smaller if needed, e.g., operating at lower temperature or using SOI. 
Furthermore, we will illustrate how to eliminate the occurrence of Vdd-^o-V ss shorting using 
a number of circuit topologies in Section 3 and Section 4. Elimination of the dissipation 
associated with the repeated charging and discharging of internal nodes, dynamic dissipa- 
tion, is more complicated. This is because it relates directly to the movement of energy 
packets between nodes at different potentials, which is governed and limited by the laws of 
thermodynamics. It is also by far the dominant dissipative mechanism at typical operating 
frequencies. For this reason, we conclude that to achieve non-dissipative computation we 
must direct our efforts to reducing dynamic dissipation. 

At this point we like to emphasize that the CV 2 d dissipation is a direct consequence of 
the way we perform the cycling of the load C and is not an irreducible minimum associated 
with charging and discharging a capacitive node. Charging a node to Vdd from in a period 
T only requires a current, i(t) such that 

i(t)dt =Q = CV dd (2.3) 

/t=o 

However, the energy dissipation is related to the integral of the square of the current 

E = R I i 2 (t)dt (2.4) 

Jt=o 

It therefore follows that minimum dissipation results if we charge the load using a cur- 
rent function i(t) that minimized the integral in Equation 2.4 while obeying equation 2.3. 
Intuitively we can see that this minimum function is none other than 

i(t) = Q/T = CVm/T (2.5) 

To prove that we add the perturbation v(t) to Q/T such that i(t) becomes equal to 
(Q/T) + v(t). Using Equation 2.4 we calculate the energy dissipation as 



E=R\ (Z.ydt + -2- v(t)dt+ v 2 (t)dt\ (2.61 




For i(t) to satisfy Equation 2.3, the second integral in Equation 2.6 must equal zero. 
With v 2 (t) always positive, we see that any perturbation of i(t) = Q/T leads to increased 
dissipation. From the above we see that minimum energy dissipation results when the step 
function of the voltage source is replaced with a current source with a step of i" = CVdd/T 
and which is turned ON for T seconds. This should not come as a surprise since as T 
gets larger, the process of charging and discharging the load with constant / becomes more 
quasistatic. 
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2.3.1 Energy Dissipation with Current Sources 

In the previous section we stated that energy dissipation is minimized by using current 
steps instead of voltage steps. In this section we will calculate this minimum. The energy 
dissipated in charging the capacitor using a current step / for T seconds is 

RC 
E = I 2 RxT = CV d 2 d x — (2.7) 



Therefore the energy dissipated in one cycle equals 



2RC 



^current source — ^ * dd ^ — \^-0) 



We see that with a current source, the energy dissipated is less than that with a voltage 
step by a factor of 2RC jT for T > 2RC . Since the minimum period for one cycle is 2T we 
see that dissipation with a current source per cycle is CV dd ARC f '. And hence the power 
dissipation of this circuit becomes 

^current source — ^ * dd ^ ^JlO / yZ.uj 

The quadratic dependence of the power on frequency is in sharp contrast to the familiar 
linear dependence in conventional CMOS, P = CV dd f. Figure 2.2 plots the power dissipa- 
tion associated with a the voltage step, solid line, and that of the current step, dashed line, 
as a function of operating frequency. Depending on the operating frequency, constant cur- 
rent cycling of capacitive loads results in orders of magnitude less power consumption when 
compared to constant voltage cycling method. We note here that quadratic dependence of 
power on frequency leads to linear dependence on frequency of the energy consumed per 
operation. This gives rise to the possibility of performing computation while consuming 
asymptotically zero energy. 

2.3.2 Multiple Capacitive Loads 

The above analysis applied to a single capacitive node. To be useful we must generalize 
the analysis to a number of simultaneously switching nodes as is the case in actual CMOS 
circuits. Ideally, each RC circuit that models a CMOS gate driving a capacitive load would 
have its own separate current source as shown in Figure 2.3. This is not practical with 
current technology. 

Instead, we use the circuit topology in Figure 2.4. Here a single current source provides 
the current that is needed by all the loads. If the RC time constant is the same for all 
the branches of the circuit then the voltages on all the capacitors will rise in unison and 
the above circuit will simplify to one with only one equivalent capacitor equal to the sum 
of all the capacitors in the circuit and one resistor equal to the parallel combination of 
the resistors in the circuit. The magnitude of the current in Figure 2.4 being equal to the 
magnitude of the sum of currents in Figure 2.3. 

In practice however, the RC time constants of the separate branches are different. Using 
a common source, the circuit branches with faster time constants will track the voltage of 

11 




10" 10' 

Frequency x (1 /4RC) Hz 



Figure 2.2: Normalized power dissipation plot for a voltage step, solid line, and for a current 
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Figure 2.3: Model of CMOS circuit with separate driving current sources. 
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Figure 2.4: Model of CMOS circuit with common driving current source. 
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Figure 2.5: Voltage waveform at the terminals of a constant current source driving a ca- 
pacitive load C . 

the current source more closely than branches with slower RC time constants. Under such 
conditions, the slower branches would lag behind the fast ones and hence the voltages on 
the loads of different branches would grow apart. With this in mind, the point at which 
we must turn off the current source is not very well defined. Turning the source off when 
the fast branches reach Vdd leaves the slow branches not completely charged and in time, 
the loads in the slow branches will pull the voltages of all the branches to some voltage 
below Vdd- Waiting for the slow branches to completely charge up results in voltages larger 
than Vdd on the fast branches and thus making the circuit susceptible to latch-up and/or 
breakdown in most VLSI technologies. 

Provided that the capacitive loads are linear, the common current source could be 
replaced with voltage source that has the waveform shown in Figure 2.5. The initial and 
final jumps in the voltage of this source are equal to VddRC/T. We observe that for T >> RC 
the current source could be replaced by a voltage source that outputs a linear ramp such 
that 

{ tVdd/T , for < t < T 
: } Vdd ,for*>T 



V XI 



(2.10) 



With this source, the current at the start of the ramp is not constant but exponentially 
builds to the desired constant level with RC time constant. The current also exponentially 
drops off as soon as the voltage of the source levels off at Vdd- Given enough time, the 
voltage on all the nodes would reach Vdd regardless of their branch time constant. Because 
of the above we choose this source as a good compromise for approximating the desired 
ideal current source driver. 

2.3.3 Energy Dissipation with Voltage Ramps 

In the previous section we saw how a voltage ramp can approximate a current source. 
Here we will examine the added energy dissipation of a voltage ramp resulting from its 
divergence from the ideal current source at the end of the ramp. For a voltage ramp, that 



13 



follow Equation 2.10, the current i(t), is equal to 

_ f 1(1 - e-*l RC ) , for < t < T 

%{t) ~ \ /( e -(*-T)/flC _ e -t/RC ) ^ mT<t V-^) 

where / = VddC/T. To get the dissipation associated with charging a load with a ramp we 
integrate the power dissipated by the above current into the resistor R over the charging 
period. We now stress the following two observations. The first is that unlike the case in 
current sources, load charging in voltage ramps continues after the ramp reaches V,id- For 
this reason, a voltage ramp must have a higher slope than the one in Equation 2.10 in order 
to yield the same effective charging time of a current source with / = CVdd/T. The second 
observation is that carrying the time of integration until all transients have settled, i.e., 
t = oo, for a voltage ramp in Equation 2.10 we find that the dissipated energy is the same 
as for the current source, i.e., E = CV£ d RC/T. 

From the above observations we see that the energy lost using a ramp equivalent to the 
current source, E ramp is 



2RC 

T-nRC 



Eramp = CV dd ( — -^^ 1 (2.12] 



where n is the number of RC time constants needed after the end of the ramp for the 
voltages on the nodes to reach Vdd, within acceptable tolerance. 

As expected, the energy dissipation of the voltage ramp rapidly approaches that of an 
ideal current source, which is the minimum possible, as T becomes >> RC . 

2.4 Ramp Generators 

We have seen how the energy dissipated in charging and discharging a capacitive load is 
directly proportional to the slope of the input ramp. In truth however, using a ramp does 
not reduce the power consumption of the whole system, it merely relocates it. If we are not 
careful about the design of the ramp generator, it is possible to dissipate in the generator 
much more energy than that saved in the circuit. Simple ramp generators continuously 
vary the conduction ratio of the pull-up and pull-down devices in their driving stage to 
produce the required intermediate voltage values of the ramp. The ramp produced will 
lower the energy dissipated in the circuit it drives, but the constant current path through 
the pull-up and pull-down devices will waste much more energy. 

Note that merely relocating the dissipation from the computing circuit to the sup- 
plies still has some advantages. One advantage is to increase packaging density in systems 
that are otherwise limited by heat removal constraints. In conventional computing cir- 
cuits, electrical energy is supplied to the circuit by copper wires and waste heat is removed 
by other mediums, such as forced air or circulating refrigerants. Since copper wires can 
transfer energy with much higher power densities than other mediums, dense packaging 
in supercomputers is usually limited by energy removal constraints, not by energy sup- 
ply constraints. Using quasistatic computing elements, even those with dissipative ramp 
generators, the same copper wire injecting energy into a dense package is the one used 
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Figure 2.6: The two places of system dissipation. 

to remove a high percentage of it out of the package thus greatly simplifying cooling and 
increasing density. Another advantage is that we might be able to use more exotic devices 
or technologies in the power supplies than in our circuits. This is because power supplies 
have fewer components and hence their components do not need to be densely packaged or 
numerously and cheaply produced like circuit components. 

If the concern is to lower the overall energy dissipation of the system however, then we 
have to consider the dissipation occurring in the ramp generators. Currently there are two 
ways of building ramp generators with little dissipation. The first approximates a ramp 
by generating a sinusoidal waveform. The second approximates the ramp with a stepping 
staircase waveform. 

In the next sections we will examine the dissipation associated with using either of the 
two methods. In our examination, we will distinguish between dissipation that happens 
in the computing circuits, from the dissipation that occurs within the ramp generator. In 
essence, we divide our system according to the energy dissipation mechanisms into the two 
parts shown in Figure 2.6. We do this to emphasize the fact that the dissipation in the 
computing circuits is purely a function of the ramp shape, while the dissipation in the 
ramp generator depends on both the shape of the ramp and the devices used to construct 
the generator. If in the future we are able to invent a less dissipative generator, we can 
then calculate the minimum overall dissipation of the system relatively easily. In addition, 
treating the two separately will more clearly illustrate how one can trade more of one 
dissipation for less of the other. 

2.5 Sinusoidal Ramp Generator 

In this section we will examine the dissipation of a ramp generator that approximates 
the linear ramp with a sinusoid. The reason for this approximation is that it is easy to 
build energy efficient sinusoidal generators using inductors. 

In this section we will show how to build a non-dissipative sinusoidal load driver. We will 
then calculate the energy dissipated in charging and discharging a capacitive load through 
conducting but slightly resistive device. 

2.5.1 Circuit Dissipation for Sinusoidal Ramps 

As we have assumed so far, we model the CMOS circuit performing computation by 
a resistor R in series with a load capacitor C. The model is based on a lumped element 
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Figure 2.7: Conventional and non-dissipative circuit analysis models. 

approximation where the R and C are the effective R and C as seen from the supply 
terminals of the chip. For appropriately sized circuits, i.e., circuits where all the branches 
have the same time constant, we can assume that the gate voltages have roughly the same 
rise time and therefore we can approximate R by Rd ev ice/N and C by C device X N for an N 
branch chip. Note that R and C depend only on N and the fabrication process technology. 
We saw that when driven by a linear voltage ramp, as shown in Figure 2.7-a, this equivalent 
circuit dissipates CV^ d (2RC)/T. A sinusoidal ramp generator replaces the voltage source 
with the inductor circuit as shown in Figure 2.7-b. To cycle a load capacitor starting at 
volts through Vdd an d back to 0, we 

1. connect the RC circuit through the inductor to V^/2, 

2. we keep the inductor connected until the current reaches zero, signaling a complete 
polarity inversion in the load capacitor, 

3. we disconnect the rail from the inductor and connect it to the Vdd to compensate for 
leakage and noise, 

4. and finally we reverse the above steps to return the load voltage to 0. 

The inductor in the above circuit acts as an electrical "flywheel" that forces the shut- 
tling of energy between the capacitor and the V^/2 supply. For now, assume the switch 
connecting the rail to the inductor is external to the chip. We will revisit this in a later 
section. 

We now examine the dissipation of our proposed circuit. To simplify the algebra, we 
let Vdd = -\-Vq and V ss = —Vq so that Vq is equal to half the rail-to-rail voltage Vdd- Our 
R,L,C circuit is described by 

d 2 V c dV c V c 

where Vc is the capacitor voltage, a = R/2L, and u>q = 1/y/LC. For the solution to 
oscillate, the circuit must be underdamped, requiring that 

2VW > RC (2.14) 

and we find that the frequency of oscillation, ujd, is given by 



0J d = V^o 2 - o? (2.15) 
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Figure 2.8: Plot of V c (t, ujd) showing the effect of fast rise and fall times on energy loss. 



where ujd is the frequency of operation and is equal to ir/T. Since R and C are fixed for a 
given chip, we can only adjust L to affect ujd- Examining the formula for ujd we discover that 
ujd steadily increases as L decreases up to a maximum, oJdmaxi an( i then sharply decreases 
as the circuit approaches the critically damped point. We find that 
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and the smallest inductance we would ever need, L m i n , is found by 



Lmin — -ft C J 2 — K^ e 

For Vc(0) = V and i c (0) = we find 



5^ device / ^^ 



V c (t) 



V e- at 



a 



{cosudt -\ sinudt) 

Ud 



and 



i c (t) = V C—e- at sinu d t 

^d 



(2.16) 



(2.17) 



(2.18) 



(2.19) 



Figure 2.8 shows a plot of normalized V c as function of time and normalized ujd- Initially 
Vc(0) = 1. With ujd close to oJdmax the voltage drops rapidly, dissipating most of the 
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capacitor energy on the way down. The reclaimed energy is insufficient to fully charge the 
capacitor back to the negative rail. As lo,i moves away from LO,i max we in effect spread each 
oscillation over a longer period of time. We see from the plot that the capacitor retains 
most of its energy since Vc comes close to the bottom rail. With lo,i only an order of 
magnitude lower than the maximum, the capacitor recovers most of its energy. 

The power dissipated in a single rail-to-rail swing for our gate consists of two compo- 
nents. The first is the power dissipated in R during the swing, E s , which results in a final 
voltage that is lower than |Vq|. Using Equation 2.18 we find that 



E S = -CV£ (l-e^T) (2.20) 



The second is the energy lost in R while charging C to the rail to compensate for the lower 
peak voltage due to E s . We will do this by simply connecting the line to the rail and losing 
some energy in R, E c , during this process. We find that 

1 —tto. 

E C = -CV 2 (1-e^Tf (2.21) 

adding these two terms and multiplying by two to allow for both directions of swing, we 
find that 

— 7TO. 

E loss = 2CV 2 (1-e^d) (2.22) 

per period. 

Using Equation 2.22, we compute the ratio of the power consumption per period of 
conventional CMOS to that of a sinusoidally driven gate, F sav i ng , as 

2 

^saving — —na yZ.Zoj 

(l-e TJ T) 

Figure 2.9a shows a linear plot of F sav i ng near LO,i max . In this region, the circuit's 
performance is close to conventional CMOS but improves rapidly with lower lo,i. As we get 
away from Ud max , the graph attains a nearly constant slope as shown in Figure 2.9b and 
loq ~ lo,i. Substituting R/2L for a, ir/T for lo,i or ujo, we approximate jj- with 

na _ 7r(R/2L) ^ TRCu 2 ^ fRC_ 

u d " (vr/T) ~ 2 ~ 2T ^ M) 

Expanding the exponential, we get 

4T 

£ saving — 2 T)f< (2.25) 

for one charging and discharging cycle. 

We see that sinusoidal ramps are worse than ideal linear ramps, Equation 2.12, by a 
factor of 7r 2 /8 owing to the sinusoidal, instead of the constant, nature of the current in the 
circuit. 
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Figure 2.9: (a) Linear plot of F sav i ng factor vs. lo,i. (b) Log-Log plot of F sav i ng 
lo,i. In both plots lo,i is normalized to LO,i max . 
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2.5.2 Generator Dissipation for Sinusoidal Ramps 

Examining Figure 2.7-b, we see that the only component that could dissipate energy in 
a sinusoidal generator is the switch. This includes energy dissipated because of the switch's 
non-zero on resistance as well as energy dissipated in the action of turning it on or off. The 
case we will analyze is when the switch is made out of MOS devices. The resistance of an 
N-Channel MOSFET, R ON , is equal to 



Ron 



V ds 



!d WjTaCo [(Vgs - Vth) - ^] 



(2.26] 



where W and L are the gate width and length of the device, C is the gate capacitance 
per area, and 71^" is the average mobility of the carriers in the channel [35]. In quasistatic 
switching, we always attempt to minimize Vds to reduce dissipation so that in general 
\Vds\ <C \Vgs ~ Vt\. With this in mind and substituting C p , the gate capacitance, for 
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WLC , we get 

Ron = - r(v L2 v s (2-27) 

The on resistance of a MOSFET depends on the difference between the gate voltage and 
the channel. In a sinusoidal generator, this voltage varies with time as the load charges. 
By using a pair of complementary devices for the switch we are able to minimize this 
dependence since the variation due to the N- Channel device cancels that due to the P- 
Channel device. This makes Ron almost constant over the period of the ramp. Ron now 
becomes 

_ I? _ T^ 

fJ-aveC s (Vdd - VjHave) C s 

where fj, ave is a weighted average of ~jl^ and ~jl^ and VxHave is the weighted average of the 
threshold voltages of the N-Channel and P-Channel devices. The energy dissipated in the 
switch during a single ramp swing has two components. The first is the dissipation due to 
charging and discharging the gate of the switch and is equal to 2C s V d 2 d for a conventionally 
driven pass gate. The second is the dissipation due to the finite on resistance of the switch. 
From Equation 2.25 and Equation 2.28 we see that the total dissipation in the switch, 
E switch, is equal to 

E switch = 2C s V dd + -Cjy M — — (2.29) 



where C'l is the effective load capacitance of the computing circuit be driven. 

Finding the minimum with respect to C s , we find that the optimal gate capacitance of 
the switch is 

Cs = cJ-M (2.30) 



'4 V T 
and the minimum dissipation in the switch becomes 

Eswitch = CV^ttJj; (2.31) 

The above analysis agrees with that reported by Roller and Athas [1] albeit with some 
modification because of differing definition of the process time constant of the switch, t s . 

Adding the dissipation in the computing circuits, the total minimum dissipated energy 
in the system, E to taU becomes 

Etotai = C L Vl L^+^.^\ (2.32) 

where r c is the time constant for the circuit branches of the computing part of the system. 
We see that after accounting for the switch dissipation, the overall energy consumption of 
the system now drops as 1/yT instead of the thermodynamic limit of 1/T. In general, 
t s — tc when both the switch and the computing circuits use the same process technology. 
Since T is always greater than t s and r c for proper operation, we see that the switch 
dissipation always dominates the dissipation characteristics of the total system. 
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2.5.3 Nested Sinusoidal Drivers 

The above analysis assumed that the switch is driven conventionally and hence dissi- 
pated 2C s V^ d for every switching cycle. The factor of 2 is there because the switch has to 
turn on and off for both the rising and the falling parts of the driven signal. Roller and 
Athas [1] have suggested driving the switch with yet another sinusoidal ramp circuit which 
itself is driven sinusoidally and so on. They report that for N nested drivers, the minimum 
dissipation would follow 1/T' 1-2 ). They caution that since every driving stage have to 
be faster than the stage it is driving, we quickly get to the point at which driving the 
preceding stage sinusoidally, adds overhead but saves no energy. Add to that the fact that 
each switch must be smaller than the one it drives, which quickly limits the useful nesting 
when the switch reaches the minimum size device of the technology. They report that with 
current CMOS technologies, N would not exceed 2 or 3 unless T is very large "on the order 
of milliseconds or seconds" . 

2.5.4 Inductor Quality factor 

So far we have assumed that the inductor in our RLC circuit enjoyed an infinite quality 
factor, Q. However, the Q's of commercially available inductors seldom reach higher than 
100. The Q identifies the fraction of energy that is dissipated by the inductor in a RLC 
circuit during one cycle. Attempts to reduce the energy dissipation of the system below Q 
fold will fail as the inductor irreducible inductor dissipation becomes the dominant factor. 
This means that the maximum attainable energy saving factor is limited to about Q. We 
can dramatically improve this limit by using high-temperature superconducting coils. These 
usually have Q's in excess of several thousands. Unfortunately, such coils require cooling, 
typically by liquid nitrogen, which increases the cost as well as decreases utility in some 
application, e.g. portable equipment. The hope however is the discovery of superconducting 
material that will work at, or slightly below, room temperature. For a more detailed 
discussion of low temperature operation please see appendix A. 

2.6 Stepwise Ramp Generator 

Inductors are not the only way to produce a gradual voltage ramp. Another way to 
produce the gradual charge transfer from one potential to the other is to move the charge 
one small voltage step at a time. This mimics the situation with the two heat baths at 
differing temperatures separated by a number of heat baths at intermediate temperatures. 
We start with a load, Cl, at zero volts that needs to be charged to Vdd- We provide N 
voltage sources each with a voltage that is Vdd/N volts greater than the previous one. 
We also provide a switch connected between the load and each voltage source as shown in 
Figure 2.10. 

To charge the load, we momentarily connect it to each voltage source, using the provided 
switches, in a sequence from the source with the lowest voltage, Vdd/N for N voltage steps, 
to the one with the highest value, Vdd- Reversing the switching sequence brings the load 
back to zero. Advantages of using a stepwise generator include elimination of inductors 
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Figure 2.11: Model of a CMOS computing circuit driven by a stepwise source. 

with their non-linear effects and more control over the rise and fall time of the ramp since 
this rise time is now not related to the load capacitance. 

Following the same organization used for the sinusoidal case, I will first examine the 
dissipation of a computing circuit when driven by a stepwise ramp. After that I will examine 
the dissipation in the stepwise ramp generator itself. For the analysis, I'll use a MOSFET 
implementation of the stepwise generator. 

2.6.1 Circuit Dissipation for Stepwise Ramps 

To account for only the dissipation in the computing circuit, we use the model shown 
in Figure 2.11. The model consists of a load capacitor, Cl, representing the effective 
capacitance of the computing circuit and a resistance, Ron, representing the effective ON 
resistances of the MOSFET's in the circuit paths to the loads. The driving voltage source 
in the circuit generates a staircase waveform starting at zero volts and rising towards Vdd 
with Vdd/N increments and with n times the RC time constants between the steps. 

We begin by assuming that the time provided between each step is much longer than 
the RC time constant of the circuit in Figure 2.11. This will insure enough time between 
steps for the circuit to settle. From Equation 2.2 we see that each step in the voltage source 
dissipated ClAV 2 /2 joules of energy where AV = Vdd/N. Since there are N steps during 
the rise and another N steps during the fall of the ramp voltage, the total dissipation a 
computing circuit that is driven by a stepwise ramp, E stepw i se , 
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The above result should not come as a surprise. Given that the time between steps is 
always fixed at the value that allows enough time for the most part of the transients to 
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settle, i.e., a few i?C"s, we see that increasing N is identical to increasing T. Said differently 

N = db (2J4 > 

and we see that Equation 2.33 is very analogous to Equation 2.8. 

2.6.2 Generator Dissipation for Stepwise Ramps 

Svensson and Roller [38] have studied the dissipation in a MOSFET implementation of 
a stepwise ramp generator. Their generator consisted of a number of large capacitors with 
each charged to one of the step voltages and each separated from the output by a NMOS 
switch. Shorting these large capacitors to the output via the MOSFET devices in ascending 
order produced the stepwise ramp. In their work they show that the optimal number of 
steps is 

where m is the number of process time constants between steps and ~p is the weighted 
average of (Ron X Cgate) of the MOSFET devices used. Given this, they calculate the 
minimum energy dissipated in the control of these switched and due to their non-zero ON 
resistance, E opt , to be 

Eo Vt = l^C L Vl D (2.36) 

We note that even though the dissipation in the computing circuit driven by a stepwise 
generator followed 1/T, the energy dissipated in a MOSFET implementation of the stepwise 
ramp generator optimally follows l/yT. This means that the energy dissipation by the 
overall system would track l/\/T. This is worse than the performance of a sinusoidal 
generator using a MOSFET switch which tracked 1/vT. 

2.7 Alternate Power Switches 

The above derivations assumed that the power switch, or switches in the stepwise case, 
was built out of the same device technology as that of the computing circuit. In both the 
sinusoidal ramp generator and the stepwise generator cases, the energy dissipation of the 
computing circuit followed the theoretical line of 1/T. We saw how this impressive energy 
saving behavior deteriorated, to 1/yT for the sinusoidal case and to 1/ \/T in the stepwise 
case, due to the non-ideal properties of the MOSFET's used in the generators. However, 
the economics of the VLSI computing circuit of the system are very different from those of 
the power supplies, or ramp generators. The technologies for both the computing circuits 
and the power switches in the ramp generators need not, and as we will show should not, 
be the same. In the coming analysis we will concentrate on the power switch for the 
sinusoidal ramp generator. Even though the proposed methods could be equally applied to 
the stepwise case, we will focus on the sinusoidal generator because of its more attractive 
energy dissipation curve. 
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To illustrate the point of using alternate technologies in the ramp generators, let us 
assume that t s < r c . We rearrange Equation 2.32 to find the value of normalized T, T/t c , 
at which the energy dissipation in the switch, E sw a c h, becomes equal to dissipation in the 
circuit E c i rcu it. We find that 

T , IT . o T r . 

r= s r (2 - 37) 

T c O T s 

Note that the energy savings following 1/T start with T/t c increasing beyond 7r 2 /4 
as demonstrated by Equation 2.25. Before T increases beyond the point indicated by 
Equation 2.37, E c i rC uit dominates the dissipation and the consumption follows the 1/T. 
Increasing T after crossing this point, E sw a c h rapidly dominates the dissipation and the 
consumption follows l/\/T . From the above we see that for every order of magnitude that 
t c is larger than t s we get an additional order of magnitude through which T can increase 
while maintaining an energy saving factor that is linear with T. That is an additional decade 
drop in operating frequency during which E sw n c h, with its inferior 1/yT dissipation factor, 
are still insignificant. 

2.7.1 MODFET Switch 

To illustrate how useful the above concept could be, we examine the possibility of using 
a Modulation-Doped FET device, MODFET, for the switch [7, 8]. For a 2/im process 
through MOSIS, the fabrication house reported a frequency of 35MHz for a 31 stage ring 
oscillator. Lee, Lee, Miller and Anderson [30] report a frequency of 1.36GHz for a 25 stage 
oscillator in 1983. Since ring oscillator frequency is linearly dependent on the process time 
constant, we see that if we use a MODFET having the reported parameters for the switch 
while using the relatively inexpensive 2/im process in the computing circuits, we can get a 
ratio of 31 for t c /t s . This means that for the first 31 fold increase in T, the total energy 
consumption of the system would follow the thermodynamic limit of 1/T. 

The above illustrates the payoff of mixing advanced technologies for the switch with 
conventional low cost technologies for the computing part of the system. The suggestion to 
use MODFET's was to merely serve as an example. Other devices with much lower r c exist, 
specially those with superconducting behavior. Such non-dissipative devices could result in 
a total system dissipation that follow 1/T throughout the entire operating frequency range. 
Unfortunately, the details of their implementation as well as the operating limits of these 
superconducting devices are not familiar to this author. 

2.7.2 Bipolar-MOSFET Switch 

So far we have concentrated on the use of field-effect-transistors for the power switch in 
the ramp generator. We favored FET's over other devices because of their superior static 
properties of requiring no additional energy to keep them switched on, and of having a 
nearly linear ON resistance irrespective of the current through them. We have seen how to 
reduce the undesirable dissipation of these switches by employing alternate technologies, 
such as MODFET's, that are highly suitable for the switch. Regardless of the technology 
used in the fabrication of these devices, the overall dissipation will always track the curve 
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predicted in Equation 2.32. The restriction in Equation 2.32 followed from our assumption 
that the gate of the switch is controlled conventionally. The attempt in Section 2.5.3 to 
reduce the dissipation by dropping this assumption yielded little improvement. This comes 
from the fact that an FET switch has a linear non-regenerative action relating its gate 
potential to its Ron when Vds is small. Therefore, every nested stage must run more than 
the minimum 2 times faster than the stage it is driving in order for the circuit to reduce 
the overall energy dissipation. Our hope is to find a device with dynamic amplification 
properties under low Vds so as to further reduce the dissipation associated with controlling 
the switch. 

A bipolar transistor has that property. The collector current in a bipolar transistor 
depends exponentially on the base-emitter voltage, Vbe an d hence the transistor resistance 
can change by orders of magnitude in response to a small change in input voltage, Vbe- 
Unfortunately, bipolar transistors have two properties that usually exclude them from low 
power systems. The first is the constant supply of base current, and thus constant dissi- 
pation, that is required to keep them turned on. The second is the non-linear and almost 
constant potential drop, Vsat, across them while they are in saturation. In other words, a 
bipolar transistor exhibits good dynamic properties when compared to a MOSFET, but has 
undesirable static power dissipation. We therefore propose a hybrid switching device that 
aims to make use of the superior dynamic behavior of the bipolar transistor and the much 
desired static properties of a MOSFET. This hybrid switch consists of a bipolar device in 
parallel with a MOSFET. 

To yield an improvement over the single MOSFET scheme, we will control the MOSFET 
switch by a gradual voltage ramp. The control current for the bipolar transistor, Ig, will 
be set by a feedback control circuit such that the collector-emitter voltage, Vce-, is kept 
slightly higher than the saturation voltage of the device. This will keep the voltage drop 
across the transistor to a minimum while maintaining hpE at its nominal level. Low Vce 
leads to least dissipation in the collector-emitter path, while high hpE leads to minimum 
control dissipation by minimizing the required Ig for a given Iq. 

The switch action proceeds as follows. First the feedback circuit of the transistor is 
turned on. At the same time we start charging the gate of the MOSFET by the voltage 
ramp. Initially, the bipolar transistor would carry all the current. As the gate voltage of 
the MOSFET rises, the voltage across the MOSFET will at some point drop below Vce °f 
the bipolar transistor. This is caused by the falling ON resistance of the MOSFET as it is 
driven ON stronger and stronger. With Vds dropping below Vce, the MOSFET will carry 
most of the current and the dissipation through the collector-emitter path of the bipolar 
transistor is eliminated. In addition, the feedback circuit controlling the bipolar transistor, 
in an attempt to keep Vce a t the programmed value, will continue to reduce the base 
current of the bipolar transistor, quickly forcing it into cut-off. In our RLC circuit, the 
current builds up from zero at the beginning of the switching cycle, rises to a maximum 
level in the middle, and drops down to zero at the end of the cycle. In the hybrid switch 
above, the bipolar transistor carries most of the current only at the beginning and tail end 
of the cycle. The hand-off of current from the bipolar transistor to the MOSFET occurs 
exactly at the time that the potential drop across the device carrying the current exceeds 
the drop across the other device. The hand-off is made more abrupt by the feedback circuit. 
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What is important to note here is that the constant current, as governed by the inductor, 
is automatically directed by the circuit to flow through the least dissipative component of 
the hybrid switch. This, while the gate of the MOSFET is controlled gradually to minimize 
dissipation. 

We can calculate the energy dissipated in a bipolar switch that conducts a charge Q as 

Ebipolar = Q(VcE + T ) (2.38) 

tlFE 

For silicon, Vbe is a function of doping and intrinsic carrier concentration and is usually 
fixed at around 0.6V. Vce is lowest when the device is in saturation. For a transistor with 
equal doping concentrations in the emitter and collector regions, Vce can theoretically 
approach zero when the transistor is pushed into deep saturation thus eliminating the 
dissipation due to collector-emitter voltage drop. Unfortunately, hpE becomes close to 
1 when the device is in deep saturation and the dissipation in the base-emitter junction 
dominates. Since hpE exponentially drops when approaching saturation, it becomes evident 
that minimum dissipation will occur when the device is operated closer to the edge of the 
active region rather than in deep saturation. Under such conditions, it is possible to lower 
Vce below Vbe while maintaining negligible base-emitter dissipation due to large hpE- 

2.7.3 Micromechanical Switch 

For our purposes, the best switch is the one that has the lowest activation energy for 
a given energy transfer through it. That is the ratio of the energy needed to control the 
action of the switch to the energy that it is capable of conducting is minimal. In addition, 
we hope that this switch has a sharp turn-on and turn-off curves as a function of control 
energy. 

Electromechanical relays have what could be the lowest value for this ratio. This results 
from their very low ON resistance due to their metal contacts. The problem with regular 
relays is that they are slow. However, by soliciting the help of micromachining, it might 
be possible to make an electrostatic switch with the properties we are seeking. Those 
properties include speed, low activation energy dissipation, very sharp turn-on, and low ON 
resistance. Given this collection of desirable properties, it becomes important to investigate 
the feasibility of using micromechanical structures for our switching purpose. By using 
metalized contacts, such as aluminum, we are confident that the ON resistance of these 
switch is orders of magnitude lower than that of a power MOSFET. To yield improvement 
over MOSFET's, we have to make sure that the activation energy of these micromechanical 
switches do not exceed that for a MOSFET by the same orders of magnitude. We believe 
that this is the case. 

Driven elecrostatically, a micromechanical relay is quite similar to a MOSFET. In the 
case of the MOSFET, charge is deposited on the gate of the device to turn it on. Likewise, 
for a micro switch, charge is delivered to an electrode causing an electrostatic force to move 
the arm of the switch as shown in Figure 2.12. We know that if the MOSFET is driven 
conventionally, it will dissipate C ga t e Vjj D energy for every toggle. Likewise we anticipate 
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Figure 2.12: Schematic diagram of a simple Electroquasistatic Micromechanical switch, 
that a micromechanical switch that is driven conventionally would dissipate, E m i cro , 

& micro = ^p\ x )*on (Z.o\)) 

where C p is the activation plate capacitance of the switch and V on is the electrostatic plate 
potential required to turn the switch on. 

Note that in the case of a micromechanical switch, the capacitance is a function of the 
lever's position. Naively, one can think that the activation voltage is the voltage required 
to move the lever from its rest position all the way to its contact position, i.e., V on . More 
accurately, one would continuously apply a bias voltage, Vbi as , to the plate of the switch 
such that the lever is as close to the contact position as possible without actually making 
a contact. Since a constant bias does not dissipate any energy, we see that the minimum 
activation dissipation for a micro switch is more accurately described by 

E mt cro = C p (x = COntact)(V on - Vbiasf (2.40) 

The dissipation indicated by Equation 2.40 above is much lower than that predicted by 
Equation 2.39. It is here that the abrupt turn-on and turn-off properties of a microme- 
chanical switch come into play. This is demonstrated by the fact that we are unable to do 
the same biasing for a MOSFET since the ON resistance of the device drops linearly with 
increasing gate voltage for a low Vds- 

Ideally, Vu as will equal V on . This however provides no contact force that is necessary 
for reliable current flow. We therefore suspect that under ideal conditions, (V on — Vu as ) 
will dominantly be a function of contact force and process evenness. 

As desirable as abrupt switches are for low dissipation, they could be troublesome 
in inductive circuits. In these circuits, the gradual increase in the ON resistance of the 
switching device provides a place into which the inductive currents in the circuit could 
dissipate. Without it, there is a potential for destructive arcing and current crowding 
to occur. Fortunately, the switch in a sinusoidal ramp generator is always timed such 
that it makes or breaks at precisely the moment when the current in the circuit is zero. 
Furthermore, to make the timing less critical, we propose to parallel this micromechanical 
switch with a very weak MOSFET. We time the MOSFET so that it turns off only after 
the micro switch breaks. This, in addition to the nearly zero current during switching off, 
will greatly reduce the undesirable inductive effects. For the least dissipation to occur, we 
turn this MOSFET ON anytime after the micro switch makes. 
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2.8 Zero Energy Computing and Reversibility 

The discussion throughout this section have focused on gradually charging and dis- 
charging the nodes of our computing circuit in order to reduce energy dissipation. We have 
seen that in the ideal case, the dissipation in the computing circuit followed RC/T. At 
this point we might be led to believe that achieving asymptotically zero energy computing 
depends only on our ability to build "ideal" ramp generator. As we shall see, this is not 
entirely correct. 

In conventional CMOS, the output node is forced to the logic level representing the 
result of an operation irrespective of the previous value of the node. In the case that the 
new output value differs from the old one, the circuit will dissipate CV^ D /2. As long as 
we are willing to dissipate this amount of energy, the previous value of the node is of little 
importance. Quasistatic, low energy, switching however depends on the gradual charging 
and discharging of the nodes. At the end of a computing step, the output node must be 
set to a logic value that reflects the outcome of an operation. Depending on the previous 
value of the node and the new forced value we get four case. They are; a 1 going to a 
1, a going to a 0, a 1 going to a and a going to a 1. In a quasistatic computer we 
connect the nodes representing the first two case to the corresponding Vdd an d ground. 
For the third case, we connect the node to a rail that is gradually swinging from 1 to 
and for the last case we connect the node to a rail that is gradually swinging from to 1. 
We rely on the old value of the node when determining what to connect it to in order to 
quasistatically set it to the new value. In otherwords, to quasistatically switch a node, we 
need to know its previous value before and throughout the gradual swing of the altering rail, 
and herein lies the problem. As the rail starts to swing, it is in effect destroying the piece 
of information it needs throughout its transition. Naively providing temporary storage for 
this bit of information relocates the problem to the time or reusing this temporary storage. 
Without this knowledge it is possible to accidently short a node to a rail at a different 
potential leading to RC governed discharge time and the familiar CV^/2 dissipation. 

From the above, we see that building non-dissipative ramp generators is only part of 
the solution in reaching asymptotically zero energy computing. The remaining bulk of this 
thesis will illustrate how to solve the other part of the problem through the use of Reversible 
Logic. 

2.9 Summary 

In this section we have shown that it is possible to shuttle charge through resistive 
mediums while dissipating vanishingly small amounts of energy by using slow rising voltage 
ramps. We have examined a number of methods to produce these voltage ramps and 
analyzed the amount of energy that is dissipated in each method. Finally, we have indicated 
that quasistatic charge transfer alone is not sufficient to result in asymptotically zero energy 
computing. In the coming sections we will see how non-dissipative computing invariably 
leads to reversible computing and that thermodynamic entropy and information entropy 
are strongly linked. 
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3. Early Implementations of Charge Recovery Logic 



3.1 Introduction 

In this section we will describe a number of possible implementations of charge recovery 
logic circuits. The common property of all of these circuits is their ability to do computation 
quasistatically thus consuming arbitrarily small amounts of energy when clocked sufficiently 
slowly. That is, their energy consumption drops linearly with frequency. Consequently, their 
power consumption drops quadratically with frequency. 

3.2 Fully-Symmetric CRL Implementation 

This section describes our first implementation of charge recovery logic. Even though 
we think that the advantages of later implementation make this one unfavorable, we never- 
theless include it in the hope that some of the ideas in it find a place in future developments. 

3.2.1 Fully-Symmetric CRL Gate 

In describing our CRL gate we start with a conventional CMOS gate and gradually 
modify it to produce a gate with all the needed properties. We use the implementation 
of a NAND gate as an example. Figure 3.1 shows a first attempt at a CRL gate next to 
a conventional CMOS NAND gate. Here we discard the pull up part of the CMOS gate 
and replace each N-Channel device with a CMOS pass gate. In addition, we replace the 
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Figure 3.1: Conventional NAND gate and an early attempt at a CRL NAND gate. 
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Figure 3.2: Modified gate with dual-rail added. 

ground rail with a swinging rail. Initially, the rail and all the nodes of the gate are at logic 
1, Vdd- After the inputs settle, the rail swings from a logic 1 to a logic 0. The output F 
now represents the value A A B. The output of this gate can now be used by subsequent 
stages. Since we use CMOS pass gates for all our switches, we need to have the complement 
of the output available as well. To generate the complement, we duplicate our circuit but 
connect it to a rail which swings in the opposite direction, as shown in Figure 3.2. We call 
the rail that rests at Vdd an d swings towards ground the bottom rail and the one that rests 
at ground and swings toward Vdd the top rail. 

For reasons that will become evident later, we require that when this circuit is in its 
reset state, i.e., when the rails are in their rest state, the outputs are at a level that turns 
off any pass gates they control in a subsequent stage. For this reason we see that the 
outputs controlled by the top rail can only drive inputs of N-Channel devices. Similarly, 
the outputs controlled by the bottom rail can only drive inputs of P-Channel devices. We 
acknowledge this fact by labeling the outputs with the subscripts N and P. Unfortunately, 
the gate we have built so far is not universal as we cannot perform negation. This is 
because a logic 1, TRUE, in this circuit is no longer represented by a voltage level but by 
the event of the output actually swinging. A quick investigation shows that a swing can 
only force a subsequent swing. To produce the complement of a swing we borrow an idea 
from conventional CMOS and augment each half of our gate with a complementary network 
to produce the universal gate shown in Figure 3.3. Note that so far, we need four times as 
many devices as a conventional CMOS gate. 
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Figure 3.3: The universal CRL NAND gate. 

To reduce confusion, we will no longer explicitly show the N-Channel and the P-Channel 
devices of the pass gates. In addition, we will assume that the single line labeled F repre- 
sents the two-line pair composed of Fp and Fn and that the single line labeled F represents 
both Fp and F^. Further, we note that a pass gate receiving the output F, is ON when 
F is at a logic 1. Before we redraw our gate, we will add some functionality that will be 
needed when we connect copies of this gate in a non-dissipative network. In our gate so 
far, we identify all the inputs as SET inputs and label them with the subscript S. We then 
add a second pass gate in parallel with each pass gate already in the circuit. We identify 
the inputs of these new pass gates as the RESET inputs and label them with the subscript 
R. We show the completed CRL NAND gate with all modifications in Figure 3.4. 

It should be obvious how we can build any logical function based on the above techniques 
in a way similar to conventional CMOS, except for the additional redundancy. 

At this point we note that in this implementation of CRL, there is always a pair of 
output wires that do not swing during the SETting and RESETting of the gate. Those 
wires rely on their node capacitance to maintain their voltage. Unfortunately, it is possible 
under some inputs for the same output wire pair to continue to be the non-swinging pair 
for a long time. Since non-swinging wires are always floating, it is possible after some time 
for the voltage on these nodes to wander due to leakage or capacitive coupling. To prevent 
this from happening, we add a number of transistors that clamp these floating nodes to 
the correct supply rail as shown in Figure 3.5. With these additional transistors, we see 
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Figure 3.4: Completed CRL NAND gate with pipelining support. 

that all the output wires are actively driven at the end of the SETting swing and therefore 
provide for periodic refreshing of the voltages on the floating nodes. 

So far, our CRL gate asymptotically requires eight times as many devices as conventional 
CMOS as illustrated by the 2-input AND/NAND gate example. For circuits that require 
complementary outputs, such as address decoders, the redundancy factor may be somewhat 
less. 

3.2.2 Reversible Pipeline of Fully-Symmetric CRL 

In this section we show how to connect CRL gates, or stages, in a non-dissipative 
pipeline. The main purpose of this method of interconnection is to provide the RESET 
inputs to each gate at the correct time. We build the pipeline out of copies of an abstraction 
box shown in Figure 3.6a. 

We think of this box as containing a parallel set of CRL gates performing any logical 
function of an arbitrary number of inputs. Symbolically, the output of the box represents a 
bundle containing the outputs of the CRL gates internal to the box. The box has two input 
branches. One is the SET branch and identifies a bundle containing all the SET inputs 
of the gates internal to the box, the other is the RESET branch and identifies a bundle 
containing all the RESET inputs. The function computed by the box is identified by the 
letter in the center of the box. Finally, indicated at the lower corner of the box is the clock 
phase used to control all the CRL gates internal to that box. A clock of 4>\ indicates that 
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Figure 3.5: First CRL implementation with output cross-coupled clamps. 








(a) (b) 

Figure 3.6: (a) CRL abstraction box. (b) Timing of the four clock rails. 

the top rail is connected to <j)\ and the bottom rail to /(f>i, while a clock of j<$>\ indicates 
that the top rail is connected to j<$>\ and the bottom rail to <$>\. 

Using this abstraction, Figure 3.7 illustrates how CRL gates are connected to produce 
a non-dissipative pipeline. The timing of the four clock lines is shown in Figure 3.6b. Note 
that the box with a function F~ x performs the inverse operation of the box with a function 
F. To SET a box, all the SET inputs must be valid and stable and all the RESET inputs 
must be idle, i.e., they come from a box that is currently in RESET, so that all the RESET 
pass gates are OFF. With these inputs, swinging the clock rails of the box away from their 
rest level will SET the box. To RESET the box, the rails are returned to their rest levels 
while the SET inputs are idle and the RESET inputs are active and stable. 
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Figure 3.7: Non-dissipative multi-stage pipeline connection. 

To follow the operation of the circuit we start with <j)\ and fa at their rest state and 
assume that the pipeline has been operating for some time. We follow the propagation of the 
input ao only, even though other parallel activity is going on. From the states of the clocks 
we see that box F\ is RESET and its RESET inputs are idle as well. Swinging 4>\ SETs F\ 
and computes Fi(ao). Swinging <f>2 now SETs F 2 and FZ and produces F 2 (Fi(aoj) an d 
Fr (Fi(aoj) = ao respectively. Now swinging <f>\ to its rest level RESETs F\, produces 
F 3 (F 2 (F 1 (a ))) and F^ 1 (F^F^ao))) = Fi(a ). The circuit is now ready to safely RESET 
boxes F 2 and Fr ■ One can see that we can continuously drive a new input into the network 
every 4>\ and successfully operate the pipeline in a non-dissipative fashion. In addition, this 
pipeline can have any arbitrary number of stages and still be driven entirely by <^>i, <f) 2 and 
their inverses only. 

There remains one problem however. At the end of the pipeline the RESET input to 
F 5 ~ is not available and hence resetting this box is dissipative. Furthermore, it could not 
be generated, as this is the place where reversibility is broken. We can however, restore 
reversibility here through brute force by connecting to the end of this pipeline a mirror, and 
an inverse, image of itself. The missing input at the end of this extended pipeline that is 
needed to reverse the last inverse box is now simply ao. With this topology, we can proceed 
without any dissipation by continually supplying delayed copies of the input to the pipeline 
at the inverse input on the far right. The technique of connecting an inverse network to 
the forward network was previously used in [13] and [11] to eliminate dissipation through 
recycling the intermediate garbage that results in conservative logic. 

Admittedly, the above solution is more of theoretical than practical interest. If re- 
versibility needs to be broken, that is, when information loss cannot be avoided, then some 
dissipation will occur for every lost bit of information. For these situations, we can reduce 
the dissipation by ending the pipeline with two identity boxes, 1(a) = a, and use the out- 
put of the lower identity box to reset itself as shown in Figure 3.8. Closer examination 
shows that the dissipation is \CVj per bit per cycle as opposed to \CV dd for conventional 
gates. Since the output of an identity box is the same as the input, the resetting swing 
proceeds normally until the output levels are insufficient to keep the appropriate pass gates 
on. Because of this, some internal nodes will have a potential that is one Vr away from 
their reset levels. The next input to the gate will short this potential difference resulting 
in \CVj dissipation per bit. Note that that we only pay this penalty at the last stage of a 
long pipeline. 
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Figure 3.8: Last pipeline stage connection for dissipation reduction. 

3.3 N-Channel CRL 

The circuit described in the previous section requires 8 times as many devices as con- 
ventional CMOS circuits. The following is a description of alternate implementations that 
require 2-4 times as many devices only. The tradeoff is that the following circuits require 
more swinging rails, 6-8 rails, than what the previous implementation needed, 4 rails. 

3.3.1 N-Channel CRL Gate 

The main idea behind these new circuits is that they have two sections. A front-end 
section that has only N-Channel devices, and a back-end section that uses pass gates made 
out of N-Channel and P-Channel devices. The new circuits require two different rails that 
are not the complement of one another. The first rail, we call the slow rail, controls the 
front-end section of the circuit. The second, we call the fast rail, controls the back-end of 
the circuit. Figure 3.9 shows an implementation of a NAND gate using this new techniques. 

In the front-end section, all SET N-Channel devices, identified by the letter "S" in the 
figure are paralleled by RESET N-Channel devices, identified by the letter "R". The pass 
gates of the back-end section, identified by the letter "P" do not have RESET pass gates. 
While in the RESET state, all the outputs and internal nodes of the front-end section are 
at V ss . The slow rail is at V ss as well. In addition, all the outputs and internal nodes of 
the back-end section are at V ss . The fast rail would be at V ss as well. We assume that 
all the SET and RESET inputs are at V ss . The circuit is now ready to accept new input 
on its SET lines. After the SET inputs become valid and stable, we gradually swing the 
slow rail from V ss to V,id- At the end of the swing, some outputs of the front-end section 
would remain at V ss while some would swing to (Vdd — V?) depending on the input value 
and the implemented function. Note that we generate the true and complement of every 
signal in the front-end section so that we could drive both sides of the pass gates in the 
back-end section. One of the main purposes of the back-end is to regenerate the rail-to-rail 
logic levels at the output of the gate that could not be generated by the N-Channel only 
section. After the front-end SETs, we swing the fast rail from V ss to Vdd- Depending on 
the computed result, we could have the pass gates of the back-end set ON or OFF. Those 
that are set on by the front-end, having their P-Channel side at V ss and their N-Channel 
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Figure 3.9: Schematic diagram of a 2-Input AND/NAND N-Channel CRL gate. 

side at (VHjd — Vt) will swing the outputs they are driving to V,id- The pass gates that are 
off would have the gates of their P-Channel devices at (V,id — Vt) and the gates of their 
N-Channel devices at V ss . The N-channel side that is off will remain off during the entire 
swing of the fast rail, the P-Channel could start conducting just before the end of the 
swing. However, assuming that the threshold voltages of the P-channel and the N-Channel 
devices are roughly equal, except for the sign, the channel-to-gate capacitive coupling of 
the P-channel devices will raise the voltage at their gate during the low to high swing of 
the fast rail thus insuring that they will remain off. Note that the bootstrapping here is 
used to shut off devices and not to recover the Vt drop of the front-end. For his reason, the 
minimal of coupling would still result in keeping the device off and would lead to proper 
operation. Note that the outputs of the back-end are now rail-to-rail and could drive the 
front-end of a subsequent stage without Vt degradation. 

To RESET the circuit, we wait until the RESET inputs are active and until the SET 
inputs go idle. First we swing the fast rail back to V ss and then reset the circuit by returning 
the slow rail to V ss . We need to reset the fast rail before the slow rail because the back-end 
does not have any RESET devices in parallel with the SET devices. The fact that the fast 
rail must SET and RESET itself while the slow rail is at the SET level forces it to have 
a much narrower duty cycle and hence the name fast rail. Because of the different duty 
cycles, we now need a total of eight rail, instead of 4, to run our circuits. The timing of 
these fast rails relative to the slow rail is shown in Figure 3.10. 

In addition to the devices that are used for computing, the new circuit has N-Channel 
cross coupled pairs that are tied to the outputs of front and back sections of the circuit. 
These devices are used to hold the voltage of the node that does not swing to the rest rail 
and hence to maintain proper operation in the presence of dark currents. 
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Figure 3.10: Timing of the fast rails in N-Channel CRL. 
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Figure 3.11: Schematic diagram of an N-Channel CRL gate that rests at V ss . 

So far we have assumed that the front-end section RESETs to V ss and SETs to V,id- An 
example of this is the N-Channel CRL buffer shown in Figure 3.11. This buffers is SET 
by the slow rail going from V ss to V,id- The slow rail rest level is at V ss . In the diagram, 
CI and C2 are clamps that are necessary to prevent the internal nodes of the first stage, 
nodes (1) and (2), from wandering when not actively driven because of leakage or noise. In 
addition C3 and C4 are clamps that prevent the output nodes from wandering. 

By modifying the cross coupled devices, we can have it so that the front-end SETs to 
V ss and RESETs to Vdd- This modification is shown in Figure 3.12. Here, the gate is SET 
by the slow rail going from Vdd to V ss and is reset by the slow rail returning to Vdd- Having 
done that, we could drive the gates that were supposed to be driven by the complement of 
a slow clock by the true clock itself. This eliminates the need for the complements of the 
slow clocks which reduces the needed swinging rails from 8 to 6. We note that unlike the 
previous buffer, no output clamps are needed. This is true since in the RESET state, the 
voltage on nodes (1) and (2) is equal to (Vdd — Vrh)- This voltage is enough to turn on 
the N-channel devices of PI and P2 and hence provide for periodic refreshing of the correct 
voltage levels at the output. 
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Figure 3.12: Schematic diagram of an N-Channel CRL gate that rests at V,id- 
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Figure 3.13: N-Channel CRL gate with latch-up protection devices. 

There remains one problem. N-channel CRL gates rely for their proper operation on 
the capacitive coupling of the fast rail through the output stage devices to the internal 
nodes (1) and (2) to boost the voltage on one of these nodes to V,id- Unfortunately, it is 
possible for the boosted voltage on these internal nodes to be boosted beyond V,id- This 
could trigger latchup. We prevent this by adding clamping devices C5 and C6 as shown 
in Figure 3.13. C5 and C6 insure that the voltage on the internal nodes never exceeds 
Vdd by more than a threshold voltage. Through proper device sizing, we can control the 
bootstrapping action so that the boosted voltage will not appreciably exceed Vdd an d hence 
reduce the need for the dissipative action of C5 and C6. We stress here that the addition of 
C5 and C6 are purely for safety and that they are completely removed form the dynamics 
of charge movements in the gate. 
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3.3.2 N-Channel CRL Reversible Pipeline 

Multiple stage pipelines of the gate described above is achieved using the same con- 
necting topology that was used for Fully- Symmetric CRL circuits described in the previous 
section and illustrate in Figure 3.7. The timing of the rails is similar to that of Fully- 
Symmetric CRL except for the inclusion of the fast rails. 

In addition to the reduction of the number of devices needed, the above circuit has the 
added advantage of simplifying reversibility. We can stack the pass gates of the back-end to 
do computation just as the front-end. Since it is the function of the whole gate, consisting 
of the two section, that must be reversible, and not the function of the subsections, we can 
embed non-reversible functions in the front-end section of our gate and not worry about it 
so long as the function of the whole gate is reversible. For example, we build an adder gate 
that is easily reversible from AND and OR gates that are not easily reversible but hidden 
within the bigger reversible adder block. 

3.4 Dynamic Considerations and Nonlinearities 

The above analysis assumes that we can, in theory, lump the gate capacitances of MOS 
devices into one equivalent linear capacitance. In practice however, we need to be more 
careful. Each rail in CRL feeds a number of branches. Ideally, the effective RC time 
constant of each branch as seen by the rail is data independent and equal to the RC time 
constant of the entire rail circuit. A branch with a longer time constant would lag behind 
during the transition. This would create a potential difference across pass gates that are 
switched on, leading to dissipation. The effect is minimized by the symmetry of CRL. 
Because of the existence of the true and complement networks for every output line, a 
swinging rail is always connected to one and only one output line. Therefore, regardless of 
the output level, the rail will always drive the same output capacitance. The only difference 
in RC comes from the fact that the true and complement networks are not identical and 
as such could contribute different i?C"s depending on the data. Properly sizing the devices 
so that the two networks exhibit the same time constant independent of the state of the 
inputs will eliminate this problem. This is possible for both fully symmetric CRL as well 
as N-Channel CRL since for both have dual polarity outputs and hence it becomes possible 
for the rail to see the same effective capacitance irrespective of the computation results. 

Another point of consideration is the nonlinearity of the capacitances of MOS devices. 
An enhancement mode MOS device has a higher gate capacitance while in inversion, i.e., 
conducting, than when it is off. At the beginning of a SETting swing, all the outputs are 
at idle and all the devices driven by these outputs are off. At the completion of a SET, 
devices controlled by a swinging output are on. For this reason, a rail that is SETting a 
gate sees lower effective capacitance at the start of the swing than at the end. With the 
inductor anchored to V^/2, the rail will not reach the opposite voltage at the end of a 
swing. This leads to dissipation when the rail is connected to Vdd or V ss after completing 
the swing. Fortunately, a rail SETting a gate is at the same time RESETting another. We 
feel that as the number of gates connected to a rail increases in a balanced way, the adverse 
effects of this nonlinearity is minimized. Intuitively, the effective RC time constant is now 
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equal to R X C average . 

In addition to the above, there is also the effect of capacitive coupling. Take an output 
wire that carries a logical true. The devices that it controls in subsequent gates are on. 
Because of the gate-to-channel capacitance, when the subsequent gates SET, the varying 
voltage in the channels of turned on devices capacitively couples to the gate. This dumps, or 
extracts, charge from the output wire of the previous stage. Again, the effect is minimized 
due to symmetry. Since each output drives identical devices that are capacitively coupled 
to the top and bottom rail swinging in opposite directions, this capacitive coupling is 
almost entirely eliminated by the symmetry. We say almost because due to the capacitance 
nonlinearity, the symmetry will cancel the coupling when integrated over the entire swing 
and not instantaneously. 

We want to stress here that while minimizing the effects of the above phenomenon 
improves the power saving factor of CRL circuits, none of the above effects, even when 
extreme, jeopardizes the logical functionality of CRL circuits. HSpice simulations of the 2- 
input NAND gate, as well as simulations of other CRL gates and circuits, have demonstrated 
proper operation of the CRL circuit in the presence of these effects. This is important in 
simplifying the design of CRL logic. If a designer incorrectly sizes a branch in a CRL 
circuit, the worst he can expect is higher power dissipation in that part of the circuit and 
not a disfunctional chip. 

3.5 Spice Simulation 

Numerous HSpice simulations were conducted to for N-Channel CRL gates and pipelines. 
Since N-Channel CRL gates rely on bootstrapping action, it was important to examine their 
robustness. These HSpice simulations have confirmed the theoretical predictions regarding 
N-Channel CRL operation. 



3.6 Circuit Example 

Figure 3.14 illustrates the design of a 3-Bit full adder. In this design, pipelining was 
carried out to its fullest extent. The adder consists of 3 1-bit full adders in the forward 
direction and 3 1-bit full subtractors in the reverse direction. In this implementation, 
pipelining was stretched to its limit in the sense that the computation is allowed to retire 
only one bit of addition every cycle. That is the carry out of an adder will affect the next 
significant 1-bit adder only during the following cycle. Because, the full 3-bit addition is 
spread out among the pipeline stages, the needed information to start the reverse pipeline 
arrives later than when it is needed. Fundamentally, the carry for the addition progresses 
in-step with the forward pipeline, while the carry for the subtraction must progress in-step 
with the reverse direction. The only way to satisfy both of these constraints is to wait until 
all the 1-bit additions have been completed before starting the subtractions. This means 
that a large number of intermediate results have to wait around until needed by the delayed 
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Figure 3.14: Schematic diagram illustrating the connection topology of a 1-cycle throughput 
3-bit CRL adder using fast carry-save implementation. 
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reverse subtractions. This explains the large number of intermediate reversible registers in 
the diagram. 

In practice, a 3-bit adder in CRL technologies would be built out of a single 3-bit carry- 
propagate adder that completes the addition in one cycle. This would eliminate the need 
for the majority of the extra registers in Figure 3.14. We will revisit the subject of CRL 
adders in more detail when we describe the details of the demonstration chip, SCRL-1, that 
was built to verify our CRL concepts. The implementation of the 3-bit adder described 
here is included to show how we could achieve single-cycle heavily pipelined performance if 
we needed to. 

Upon the discovery of Split-Level CRL, the research focus shifted away from the tech- 
niques described in this section and towards this new and much simpler CRL technique that 
promised obvious advantages. In the next section I shall describe this new and improved 
technique, which we call Split-Level CRL. 
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4. Split-Level Charge Recovery Logic 



4.1 Introduction 

In the previous section we presented forms of early CMOS charge recovery logic (CRL), 
with a power dissipation that falls with the square of the operating frequency, as opposed 
to the linear drop of conventional CMOS circuits. Our original implementation, however, 
had some drawbacks. It required 16-8 times as many devices as conventional CMOS, used 
4-2 wires for every signal, and relied on node capacitances to hold a logic level on half of 
the wires. 

In this section we present a much improved form of CRL, Split-Level CRL, that uses 
twice as many devices as conventional CMOS, requires only one wire for every signal, and 
actively drives all outputs during sampling. Further, we will show how to construct Split- 
Level CRL circuits using only 2 external inductors for every chip. 

Conceptually, Split-Level CRL differs from earlier CRL in two ways. The first is the use 
of Split-Level voltages. The second is the elimination of the RESET devices and delegating 
the action of restoring the voltage on SET nodes to gates in the reverse pipeline. 

4.2 Split-Level CRL Gate 

We begin by describing the topology and operation of a Split-Level CRL inverter. Like 
conventional CMOS, SCRL gates can have many inputs and outputs. We select the in- 
verter to simplify the description. A device-level diagram of the SCRL inverter is shown in 
Figure 4.1. It is identical to a conventional inverter except for the addition of a pass gate 
at the output and the fact that the top and bottom rails are now driven by clocks rather 
than Vdd an d GND. We call the clock controlling the top rail <j)\ and that controlling the 
bottom rail j<$>\. We refer to the clocks that control the pass gate as Pi and jP\. 

Initially, the input, <^>i, /^>i, the output, and all internal nodes are at V^/2. In addition, 
Pi is at GND and jP\ is at Vdd, i- e -> the pass gate is turned off. After accepting a valid 

internal node 

/PI 




Input „ 1 t H S Output 



/Ol 

Figure 4.1: Split-Level CRL inverter. 
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Figure 4.2: SCRL abstraction box. 

input, Vdd or GND, we turn the pass gate on by gradually swinging Pi and / P\ to Vdd an d 
GND respectively. We now gradually swing <j)\ to Vdd an d /4>i to GND. The fact that 
both 4>\ and j<$>\ start at V^/2 and sp/if towards V^ and GND respectively is the reason 
we call this family Split-Level CRL. If the input to the gate was at Vdd then the output 
would follow I4>i to GND. If the input was at GND then the output would follow <j)\ to 
Vdd- We note that at the end of the (f>i, /4>i swings, the output is the logical NOT of the 
input. The output is also actively driven and could now be sampled by another gate later 
in the pipeline. 

After the output is sampled by a later gate, the pass gate of this inverter is turned off 
thus tri-stating the output. Following that, we return <j)\ and /<f>i to V^/2. This in effect 
restores all the nodes except the output to V^/2. We are now ready to accept a new input. 
Please note that allowing the input to change prior to resetting all the nodes to V^/2 could 
turn some devices on while there is a potential difference across them leading to dissipation. 

Remember that the output is still at a valid logic level, not V^/2, and before turning on 
the pass gate we must restore the level of this output to V^/2 to prevent dissipative charge 
sharing. The promise is that at the point that the pass gate disconnected the output from 
the inverter, the output was connected to a different gate that has the job of restoring its 
level to Vdd/ 2- We will show how this is done in the following section. 

4.3 Reversible Pipeline Connection and Timing 

The reason for not letting a SCRL restore its own output to Vdd/ 2 is to allow pipelining. 
Note that to non-dissipatively restore the output to V^/2, the input to the gate must be 
held constant during the splitting and restoration of its rails. The same restriction dictates 
that this gate does not restore itself before the subsequent gate in the pipeline restores itself 
and so on. This means that a new input to a pipeline must be held constant until the effect 
of this input propagates all the way to the end of the pipeline and until the restoration 
of the pipeline starting from the last stage reaches back to the first gate. This form of 
"pipelining" is obviously not very useful. 

In this section we show how to connect SCRL gates, or stages, in a non-dissipative 
pipeline. The main purpose of this method of interconnection is to provide a way of 
restoring the level of gate outputs to V^/2 with the right timing. We build the pipeline 
out of copies of an abstraction box shown in Figure 4.2. 

We think of this box as containing a parallel set of SCRL gates performing any logical 
function of an arbitrary number of inputs. Symbolically, the output of the box represents a 
bundle containing the outputs of the SCRL gates internal to the box. The input to the box 
represents a bundle containing all the inputs of the gates internal to the box. The function 
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Figure 4.3: Non-dissipative multi-stage pipeline connection. 

computed by the box is identified by the letter in the center of the box. Finally, indicated 
at the bottom of the box are the clocks used to control both the Split-level rails and the 
pass gate controls of all the SCRL gates internal to that box. A clock of 4>\ in the lower 
right corner indicates that the top rail is connected to 4>\ and the bottom rail to /^>i, while 
a clock of Pi in the lower left corner indicates that the pass gate is on when Pi is high. 

Using this abstraction, Figure 4.3 illustrates how SCRL gates are connected to produce 
a non-dissipative pipeline. Note that the box with a function P _1 performs the inverse 
operation of the box with a function P. The computation proceeds from left to right in the 
top half of the pipeline and the "uncomputation" proceeds from right to left on the bottom 
half of the pipeline. 

Each line linking SCRL gates is connected to the outputs of two different SCRL gates. 
For example, node (a) is connected to the output of Pi and to the output of P 2 ~ . There 
are two reasons why no logic fights occur between the gates driving the same line. The first 
is that when one gate is driving the line the other is tri-stated and visa- versa. The second 
is that during hand off, the voltages at the output of the gates is guaranteed to be equal. In 
this pipeline, the forward gates are responsible for gradually swinging an output line from 
Vdd/2 to Vdd or GND depending on the computation. The reverse pipeline is responsible 
for restoring the output line from the active levels to V^/2. 

To avoid dissipation, the backward gates have to determine the value of the output that 
they are about to restore to V^/2 and set their output to that level before their pass gate 
is switched on, i.e., before the line is handed off from the forward gate. To see how this 
works, we go through the events that occur after a new input, say a® is presented to the 
pipeline. First Pi turns on the pass gate of Pi and turns off the pass gate of P 2 ~ . Next (f>\ 
splits setting node (a) to the valve Pi(ao). P2 goes through similar transitions and produces 
P2(Pi(«o)) at node (b). Similarly P 2 ~ produces P 2 ~ (P2(Pi(«o))) = Pi( a o)- Note that at 
this point the voltage levels at the outputs of Pi and P 2 ~ are at the same level which means 
that it is now safe to hand off node (a) to P 2 ~ from Pi by swinging Pi low. After the hand 
off, we can restore Pi by restoring <j)\. This could occur even without having to wait until 
P2 is restored because P 2 ~ is still holding node (a) at its valid value. After P2 is restored 
P 2 ~ gradually restores node (a) to V^/2 and hands it over to F\. The timing diagram 
for a four phase clocking scheme is shown in Figure 4.4. For 4>\ . . . ^ in the figure, a high 
indicated when they are split and a low when they are restored. For Pi ... Pj, a high is Vdd 
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Figure 4.4: Rail timing for 4 phase SCRL. 

and a low is GND. With this pipeline, we are able to accept a new input every 4>\ without 
needing to wait for the restoration of later stages. 

There remains one problem however. At the end of the pipeline, the input to F^ 
is not restored and hence driving this line is dissipative. Furthermore, it could not be 
generated, as this is the place where reversibility is broken. This implies the fundamental 
limit that links information entropy with thermodynamic entropy. If at any moment a piece 
of information that is vital to reconstruct the past is lost, energy is dissipated. Fortunately, 
this dissipation occurring only at the end of a long pipeline is negligible. 

4.4 SCRL Clocking Variants 

In what follows, we will describe a number of alternatives for constructing SCRL circuits. 
These circuits differ primarily by the number of required phases and/or rails that are needed 
to control their operation. The pipeline described previously in this attached paper required 
four-phase clocking. This used four different clock phases in the forward pipeline and four 
others for the reverse pipeline. By four-phase we mean that the shortest feedback path 
in the pipeline has to span a minimum of four pipeline stages. In this following sections 
we will show how to construct SCRL circuits using two-phase, three-phase, five-phase and 
six-phase pipelines. One might simplistically think that less phases lead to less required 
rails. This unfortunately is not true since for some implementations the required phases 
are non-symmetric and therefore the complement of a phase cannot be used for more than 
one purpose. For this reason, the primary reason for reducing the number of phases is to 
minimize the number of stages for the shortest feedback path. Additionally, the lower the 
number of phases that a SCRL circuit uses, the easier it is to understand and apply. 

4.4.1 Two-Phase SCRL 

For all the implementations that will be described in this section, the basic gate is the 
same as the one described in the Section 4.2. Figure 4.5 shows a pipeline of a two-phase 
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Figure 4.5: Two-Phase SCRL pipeline. 




Figure 4.6: Rail timing for 2 phase SCRL. 

SCRL implementation. The timing relationships among the rails are shown in Figure 4.6. 
Two-phase SCRL forfeits the benefit of always actively driving the nodes whenever they 
are sampled in exchange for achieving two-phase pipelining. 

For (f>i . . . (f>4 in the timing diagram, a high indicates the time when (f> and /(f> are split 
and a low indicates when they are at V^/2. For Pi . . . P4, a high indicates that P is at 
Vdd an d J P at GND while a low indicates that P is at GND and jP at V^d. The bottom 
two timing lines indicate the states of outputs driven by <j)\ and fa gates. A high there 
indicates when the output is at an active level of Vdd or GND, while a low indicates that 
the output is at V^/2. The shaded regions in the timing diagrams indicate the times at 
which the signals are not being actively driven, i.e., floating at an active level. 

4.4.2 Three-Phase SCRL 

Figure 4.8 shows the timing diagram of a three-phase SCRL implementation. The 
bottom timing line shows the timing of an output that is driven by a 4>\ gate. Note that in 
three-phase, and higher, implementations the outputs are always actively driven. Figure 4.7 
shows a three-phase SCRL pipeline. 

It is relatively easy to generalize the above concepts to five-phase, six-phase, etc. Since 
three-phase systems achieve active driven outputs, the usefulness of higher phase systems 
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Figure 4.7: Three-Phase SCRL Pipeline. 
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Figure 4.8: Rail timing for 3 phase SCRL. 



could be limited. 



4.5 Non-Inverting Stage 

Since the basic SCRL gate mimics that of conventional CMOS, we find that it is not 
possible to pass a signal through a SCRL stage without inverting it. For some circuits it 
is necessary to receive both the true and complement of a logical signal simultaneously at 
the inputs of a logic gate. Starting with a single signal, it is not possible to have its true 
and complement arrive at a later stage simultaneously given the circuits we have described 
so far. In order to pass a signal without inversion we substitute the basic SCRL gate with 
the one shown in Figure 4.9. Please note that this buffer requires an additional set of 
controlling clocks we call "fast <j)\ and fast /^>i for a 4>\ gate. The restriction on fast <j)\ is 
that is splits immediately after 4>\ splits and that is restores just before 4>\ restores. In other 
words, the transitions of 4>\ contain within them the transitions of fast <j)\. For stages where 
we want to pass a signal without inversion, we use a gate similar to the one in Figure 4.9 
and we clock its fast clocks according to the relations described. 

In place of the inverters in Figure 4.1 and Figure 4.9 one can put any CMOS gate such 
as NAND, NOR etc. We can see that an additional benefit of a non-inverting SCRL gate, 
is that it allows each functional block to have a 2-level logic implementation. This generally 
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Figure 4.9: Non-Inverting SCRL Gate. 

aids in reducing the storage buffers that are sometimes needed for reversibility. 

Another benefit of having a 2-level SCRL has to do with optimal step-up ratio of logic 
gates. It is well known that a CMOS inverter made out of the minimum size devices can 
optimally drive between 3-5 other inverters of the same size. A minimum sized inverter 
driving more than this optimal step-up number of loads similar to its size would have a 
larger delay. Since the power saving of SCRL is referenced to the maximum operating 
frequency of a similar circuit in conventional CMOS, this longer delay leads to less power 
savings. For an inverter to drive more than 3-5 loads and maintain the same speed, it must 
be made out of larger sized devices. Unfortunately, larger devices have larger input gate 
capacitances and hence present a larger load to the gates that are driving them. To see how 
this could be a problem, let us consider building a multiplier out of an array of identical 
1-Bit SCRL adder gates. The multiplier would consists of an array of gates in which each 
gate takes its inputs from a previous identical gate and provides on its output the data for 
the inputs of another identical gate. Typically in these arrangements, each output would 
fan-out to drive more than 3-5 loads because each input to a gate feeds a number of devices 
internal to that gate. For SCRL, just as for CMOS, having an output drive more than 3-5 
loads its size is not optimal. As mentioned earlier, increasing the driving capability of a 
gate so as to be able to drive the loads, i.e., by doubling the width of the devices used in 
it, also increases the input capacitance, and hence the load, that this gate presents to the 
identical gate driving it. By attempting to increase the driving capability, we also increased 
the loads, and thus lost the benefit that we where attempting to gain. 

Having 2-level SCRL allows for increasing the driving capability of a gate without 
increasing the load it presents to the other gates. This is done by performing most of the 
computations in the first level and then using the second level to provide the buffering. For 
this reason the first stage can consist primarily of minimum sized devices, and thus present 
the minimum load for the previous gate, while the second stage is made of devices 3-5 times 
the minimum size to give optimal driving capability. 

Finally, the timing diagram of a two-phase SCRL with fast clocks is shown in Figure 4.10. 
The figure indicates the position of the transitions of the fast rails using the dashed lines. 



50 




Figure 4.10: Timing diagram of Two-phase SCRL with fast rails for non-inverting stages. 

4.6 External Inductors 

In Section 2 we examined how quasistaic switching could be achieved with the aid of 
external inductors. For a 4-Phase SCRL pipeline, we need a minimum of 20 seperate rails 
to control the operation of the pipeline. Figure 4.11, is a diagram of an inductive rail driver. 
A rail could be approximated by a capacitor in series with a resistor. The capacitor is the 
sum of the capacitances that the rail is driving and the resistor is the equivalent resistance 
of the devices through which the capacitances are driven. Suppose that the initial voltage 
on the rail was Vi n n and we want to swing the rail to Vfi n . To start the swing, we connect 
the rail through an inductor to a DC power supply at (Vi n it + Vfi n )/2. Current starts 
to build up in the inductor and the rail starts the swing towards Vfi n . At the moment 
that the current drops back to zero again we disconnect the inductor. The rail should 
now be at Vfi n . The action of connecting and disconnecting the rail is performed by the 
power MOSFET. Please note that the inductor is only necessary during the transition and 
is otherwise disconnected from the rail. Note further that the current in a disconnected 
inductor is zero. With this in mind, we should be able to multiplex the inductor among 
multiple rail circuits so long as these multiplexed rails do not have simultaneous transitions. 
Examining the timing diagram of Figure 4.4, we see that no more than two transitions occur 
simultaneously at any moment. By using power MOSFET multiplexors on both sides of 
the inductor, rather than a MOSFET on one side, we see that the maximum number of 
required external inductors is 2. Integrating everything but the inductors on a silicon chip 
means that a Split-Level CRL chip requires 7 additional pins for proper operation. Two of 
these pins are Vdd an d GN D. 

4.7 Spice Simulation 

A number of HSpice simulations were carried out on simple SCRL gates. HSpice simu- 
lations were also carried out on submodules extracted from the actual layout of our 8x8 
demonstration chip, SCRL-1, to check for SCRL circuit operations when parasitics are 
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Figure 4.11: Inductive rail driver circuit. 

included. All of the simulation results were verified by test measurements on the actual 
demonstration chip as will be detailed in Section 6. 

4.8 Lowering Irreversibility Cost 

In CRL and SCRL the reverse pipeline is required to accurately provide a delayed copy 
of the inputs that were used in the forward pipeline. Without the reverse pipeline we do 
not have enough information to always correctly compute the delayed copy of the inputs 
that are required to non-dissipatively reset the stages in the forward pipeline. The penalty 
of erroneously computing a delayed copy of these inputs is to dissipate energy similar to 
conventional CMOS for every erroneous bit. Unfortunately, there are situations in which 
providing the inverse of a function in the forward pipeline is cumbersome. Luckily all is 
not lost since in most of these cases we could apply a number of techniques that would 
make the dissipation associated with this irreversibility minimal. We must stress here that 
even though we might allow the breaking of reversibility at certain selective points in the 
system with ah of the undesirable effects that we have mentioned above, we still insist 
on employing reversibility throughout the majority of the system. This is in contrast to 
proposals that do not employ reversibility anywhere in their systems and hence are faced 
with the undesirable effects at the majority of the nodes in their systems. 

4.8.1 Irreversibility Is Not Free 

Before we describe how to reduce the dissipative effects of irreversibility we have to warn 
that irreversibility is not free. The best we can do is reduce the energy penalty associated 
with irreversibility. We can never eliminate it. A system containing irreversible elements 
is not a system in which the energy of the system can be asymptotically reduced without a 
lower bound. The packets of energy that are dissipated at the points where reversibility is 
broken in the system set a non quasistatic limit on the minimum energy dissipation of the 
system regardless of operating frequency. However, in environments where we are stuck, 
the following is included to aid in reducing the cost of increasing the information entropy 
of the system. 



52 



-Output 




<H-X} 



Figure 4.12: Scaling down stage sizes before breaking reversibility in SCRL. 

4.8.2 Statistically Controlled Irreversibility 

The first technique relies on the observation that we are mainly concerned in reducing 
the average, not the instantaneous, power consumption. Gates that computed the inverse 
function of the gates in the forward direction produce a correct copy of the inputs all 
the time. Without these inverse gates, we cannot guarantee to be correct all the time. 
In certain applications however, we can guarantee to be correct most of the time. Since 
dissipative events only occur whenever we guess wrong, being correct most of the time 
results in substantial energy savings when compared to conventional CMOS without the 
need for reversibility. To illustrate this we consider an example of an 8-Input NAND gate. 
This gate outputs a FALSE if and only if all the inputs where TRUE. Otherwise, this gate 
outputs a TRUE. Assuming that the input bits are random, the probability of the output 
of this gate being at TRUE is 255/256 = 0.996. If we always assume the output to be at 
TRUE, then we will have a dissipative event, caused by a wrong prediction, only 0.3% of 
the time. In the pipeline in Figure 4.3 let the F\ be this 8 input NAND gate. Then we 
can omit F^ , assume that this omitted inverse gate output a FALSE all the time, and be 
right 99.6% of the time. This could be important in situations in which the computation 
of F^ is not feasible or otherwise cumbersome. 

4.8.3 Where to Break Reversibility 

The second technique concerns the way multi-stage buffering is done in CMOS and in 
SCRL. To drive a large load in CMOS, one must go through a number of progressively larger 
devices with each device driving another that is slightly larger than itself until the last one 
in the chain is large enough to drive the load. In SCRL, each larger stage is paralleled 
by another stage of comparable size in the reverse direction. If reversibility were broken 
immediately after the largest stage then dissipation would be large because the fact that the 
input capacitance of the large reverse gate is significant. To alleviate the problem, we must 
proceed with the pipeline beyond the last stage with inverters in the forward and reverse 
direction scaling down the size at each successive stage until we reach the minimum size 
possible. If reversibility is broken immediately after this minimum size stage, dissipation is 
minimized due to the much smaller input capacitances of the reverse stage. This is shown 
in Figure 4.12. 
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5. Demonstration Chip Details 



5.1 Introduction 

To verify the quasistatic operation and quasistatic behavior of Split-Level CRL, we have 
fabricated and tested an 8 X 8 CMOS multiplier chip employing the circuit techniques of 
Split-Level Charge Recovery Logic. Split-Level CRL was the obvious choice for implement- 
ing the demonstration chip because of its simplicity and closeness to conventional CMOS 
circuits. In this section, I will describe the internals of this demonstration chip. In the 
following section I will describe the measurement techniques that I used to verify the lower 
energy consumption of this chip as well as report the results of those measurements. 

5.2 High Level Multiplier Design 

Before we examine the available alternatives to building a multiplier, I will first review 
the needed operations to multiply two binary numbers. Let us assume that we want to 
multiply A = [0,3, ■ ■ ■ , ao] by B = [63, • • • , bo] to produce C = [c?, • • • , Co]. The product C 
would then be 

C = (b x A) + (&! x A) + (b 2 xA) + (63 x A) (5.1) 

Since in each of the product terms, b{ is a binary bit with a value of either or 1, the 
multiplication could be carried out by bitwise ANDing b{ with each bit of A. In addition, 
we note that all the partial products are 4 bits wide and therefore we only need three 4-bit 
adders to perform the multiplication. We must of course correctly position the consecutive 
adders so that each addition is left justified to reflect the significance of the b{ bit. So in 
essence the multiplication of binary numbers involves nothing more than repeated justified 
additions of the partial products. There are a number of different ways one can build an 
N X N multiplier that will reflect the above procedure. The variations come from the way 
one performs the additions since the calculation of the partial products have already been 
reduced to trivial bitwise ANDing. As is widely known, the critical path in any adder is 
the time it takes to propagate the carry all the way across the width of the sum. This is 
because when adding two numbers, the most significant bit could be affected by the sum in 
the least significant bit. Simplistically, one would perform each addition allowing enough 
time for each addition to completely finish, that is allowing enough time for the total carry 
propagation, before proceeding to add the next partial product. Fortunately, the carry 
effect is one- directional and bits of higher significance never affect the lower significant part 
of a sum. With this observation, we see that we need not wait for each addition to fully 
complete before commencing the next one. More specifically, we can start the following 
addition as soon as the previous one produces the least significant bits that the next adder 
needs. As long as the carries in the previous addition are at least one bit ahead of the 
carries in the current addition, both can proceed concurrently. Indeed this is how most fast 
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Figure 5.1: Block diagram of a 4-Bit SCRL multiplier. 

multiplier are built today. The way in which this overlap is accomplished is the origin for 
the multitude of options for fast N X N multipliers. 

Unlike conventional CMOS, an SCRL multiplier needs to compute the partial products 
in both the forward and the reverse directions. For an N X N multiplier constructed from 
N — 1 adders, the inverse functions needed in the reverse pipeline are nothing but JV-bit 
subtractions. Figure 5.1 illustrates the organization of an SCRL 4-Bit multiplier. Briefly, 
the "+" blocks take in a partial sum P, the operand A and the corresponding bit from B. 
The block bitwise AND's each bit of A with the supplied 6 8 - and then sums the result with 
the partial sum P. In addition to producing the sum, each adder block passes on a copy of 
A for use by subsequent adders. In practice, the path of operand B contains synchronizing 
reversible registers. They are omitted from the diagram to reduce clutter. The subtraction 
blocks are identical to the addition block except that they subtract (pi X A from the value 
presented to their P inputs. Since the P inputs to the first adder are zero, the first adder 
and the corresponding subtractor could be eliminated. 

Because of the need for the inverse computation, our design cannot easily take advantage 
of the ability to overlap the carry propagation in the consecutive adders. The reason 
for disallowing the overlap is that it complicates the book keeping necessary to perform 
the needed subtractions and requires a large number of additional intermediate registers. 
For this reason, each horizontal adder in our implementation is allowed enough time to 
completely finish its operation before the following adder can proceed. 

In a conventional setting, not being able to overlap the carry propagations leads to 
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slower performance with no additional advantage. Fortunately, this is not the case within 
the context of SCRL technology. We remind ourselves that the only reason we employ SCRL 
circuit techniques was to reduce energy dissipation. As we explained in Section 2, energy 
savings occur only as a result of operating the chip below the maximum possible operating 
frequency. For this reason, the clock frequency of circuits employing SCRL techniques is 
always smaller than the maximum possible, more than 10 time slower in most cases. 

In conventional CMOS, the designer is mostly concerned with minimizing the worst case 
propagation delay of the critical path of the circuit in order to meet the specified operating 
frequency target. In SCRL however, the clock period is much larger than the worst case 
propagation delay of the circuit in order to yield the promised energy savings. For this 
reason, having to wait for the completion of previous additions before proceeding with 
subsequent ones will not result in erroneous operation as there is a large margin between 
operating frequency and delay. Slower circuits however do mean more dissipation since the 
energy dissipation factor is inversely proportional to the ratio of the operating frequency 
period to the average propagation delay. Note that in SCRL, it is the average and not the 
worst case delay that is of interest. This is because, unlike conventional CMOS, there a 
purposely introduced margin between the worst case delay and the operating frequency to 
save energy. The concern in SCRL towards slow circuits is that they reduce the energy 
saving advantage of SCRL by eating away at this introduced margin. Since our concern is 
with the average energy consumption, our interest is therefore in the average circuit delay. 

From the above we see that SCRL circuits should be optimized to minimize the average, 
and not the worst case, propagation delays. Having said that, we note that when the average 
propagation delay is taken as the measure of comparison, the difference in performance 
between allowing or disallowing the overlap of the propagation of carries in the adders of 
our multiplier becomes insignificant. This is because on the average, given random data, 
a carry propagates only 1.6 bit positions to the left. The pathological case of the carry 
propagating from the least significant position to the most significant position occurs very 
infrequently and therefore a non-overlapped implementation would on the average be only 
1.6 times slower than the heavily pipelined one. 

Because of the above reasons, we choose the non-overlapped implementation since it 
greatly simplifies the design of the inverse pipeline while not compromising power dissipa- 
tion. 

5.3 Multiplier Details 

In this section I will attempt to describe the internal details of the 8x8 multiplier 
demonstration chip we call SCRL-1. The description will follow the hierarchy of the design 
from the high level blocks and would proceed on to describe the details of their components 
and sub components. 

At the highest level of the hierarchy, the multiplier consists of 12 reversible pipeline 
levels. All reversible levels share the property that each contains one and only one reversible 
pipeline stage. A reversible pipeline stage is defined as a block containing exactly one 
pipeline stage which computes a logical function and proceeds in the forward direction, 
accompanied by a matching pipeline stage which computes the inverse of that function 
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and which proceeds in the reverse direction. SCRL-1 was implemented using SCRL circuit 
techniques that were describes in detail in Section 4. More specifically, each pipeline stage 
in SCRL-1 chip is a 2-level SCRL implementation that was described in Section 4.5. 

5.3.1 Reversible Level 1 

This is the first level in the multiplier pipeline. The two operands A and B are fed 
directly from the chip pins to this level through ESD-protected input pads. The structure 
of these input pads is slightly different from those used in CMOS and will be described in 
a later section. 

Throughout SCRL-1, operand A is repeatedly presented to the inputs of the 8-bit adders 
to be conditionally added to the partial sum computed so far depending on the value of 
the bit hi from B. In general, adders contain a number of XOR gates operating on every 
input bit to produce the partial sum or to compute the propagate-generate signals that are 
needed for processing the internal carries. Noting that an XOR is a gate that computes 
(abWab), we see that providing the inputs to the adders in both their true and complement 
forms greatly simplifies the design of the adders. In SCRL-1 all the inputs to the adders 
are provided in their true and their complement form. In addition, the sums produced 
from each adder are also in dual form since a partial sum from one adder is used as an 
operand by the next adder. However, SCRL-1 has input pins only for the true copy of the 
operands. For this reason, the function of reversible level 1 in SCRL-1 is to generate the 
needed complements of the signals on the input pins and to feed the true and complement 
copies simultaneously to the adder in reversible level 2 of SCRL-1. 

In any multiplier, the first adder retires the two product terms 

PartialSum = (b X A) + (&i X A) 

since at this point the partial sum is zero. We therefore see that the first adder requires 
bits bo and b\ from operand B to perform its function. For this reason, reversible level 1 
provides the dual true and complement copies of only these bits and passes the remainder 
bits of B in only their complement form. The reason for complementing the unused B bits 
is that an SCRL inverter needs half as many devices as an SCRL buffer. 

Reversible level 1 does not have any logic for the inverse pipeline stage in the reverse 
direction. This is because the output from a reverse pipeline stage is only needed to 
eliminate energy dissipation in the previous pipeline stage. Since reversible level 1 is the 
first level in the SCRL-1 chip, and since we are not concerned with the energy dissipation in 
the pin drivers to SCRL-1, the reverse component of the pipeline in this level was omitted. 

Figure 5.2 shows the block diagram of reversible level 1. It takes in the inputs bln<7:0> 
and aln<7:0> corresponding to operands A and B, and outputs bOut<0:3>, nbOut<7:2> 
and aOut<15:0>. bOut<0,2> are the true and complement copies respectively of bln<0> 
produced by the module splitRegl_16 (110). Another copy of splitRegl_16 (19) pro- 
duces the true and complement of bln<l>. bln<7:2> are inverted by copies of invl_l 
module to yield nbOut<7:2>. Finally, the module splitReg8_2 takes in aln<7:0> and 
produces aout<7:0> which are the true copies of the inputs and aOut<15:8> which 
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Figure 5.2: Schematic diagram of reversible level 1. 
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Figure 5.3: Schematic diagram of splitRegl_16 module. 

are the complement. Internally, splitReg8_2 contains 8 copies of a module similar to 
splitRegl_16 except for device sizing. 

Figure 5.3 shows the design of the splitReg_16 module. We examine it in detail to 
illustrate the structure of a typical SCRL gate. From the left, the first two P-Channel 
devices and the first two N-Channel devices are part of the first stage of this 2-level SCRL 
gate and are controlled by the slow rails. They are followed by a CMOS inverter that is 
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part of the second logic stage which is needed to generate the non-inverted signal and are 
controlled by the fast rails. The remaining devices make up the pass gates that are needed 
at the output of every SCRL gate and are controlled by the pass rails. 

Throughout the diagrams of SCRL-1, xxxxRail<0> indicated the positive going rail 
while xxxxRail<l> indicates the negative swinging one. The same is followed for the pass 
rails. The prefix f in the rail name indicates that this rail controls a component that is part 
of the forward pipeline, while the prefix r indicates that the rail controls a component that 
is part of the reverse pipeline. Since reversible level 1 does not have a reverse component, 
there are no r prefixed rails in the diagram. Examining the diagram one could see that the 
two inverters in the first stage perform the same function and hence might be redundant. 
This is not the case for the splitRegl_16 module. The replicated devices are included to 
maintain the symmetry of the drive capabilities for the two outputs of the module. 

5.3.2 Reversible Level 2 

Reversible level 2 is the first level that contain inverse blocks for every function in its 
forward pipeline. Reversible level 2 performs the following functions: 

• It produces the true and complement of nbln<2> that will be needed in the next 
level. 

• It passes the other bits of B for later levels. 

• It perform the addition of the first two partial products. 

Figure 5.4 shows the schematic for reversible level 2. Since this level does contain 
the inverse pipeline stages, each module in the schematic is a module containing both 
the forward as well as the reverse components of the pipeline. For example, the module 
revlnvl_l not only inverts the unused B bits and passes them on to the next reversible 
level, it also reads the levels at its output and later computes the "inputs" as sequenced 
by both the forward and the reverse control rails. As an example, I include Figure 5.5 
which contains the schematic for the revlnvl_l module. Since an inverter is the inverse 
function of itself, this module contains an inverter module invl_l in the forward direction 
and another invl_l in the reverse direction. The forward one is controlled by the forward 
control rails and the reverse one is controlled by the reverse control rails. At this point I 
wish to emphasize that even though the names aln and bOut reflect the flow of data in 
the forward direction, each line is now both an input and an output depending on where 
we are in the cycle. 

Module revf irstadd8 in the schematic performs the addition of the first two partial 
products and produces the partial sum in both true and complement form on rOut<3:0> 
and sOut<15:0>. rOut<3:0> correspond to the true and complement of the two least 
significant bits of the result and sOut<15:0> correspond to the 8 most significant bits of 
the result. We get 10 bits from the first addition because one of the partial products is left 
justified by one position to reflect the significance of b\. The true copies from rOut<0:3> 
correspond to the first two bits of the total product and are not used in the next additions. 
sOut<15:0> are all used in the subsequent addition. Since we are passing both A and B 
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Figure 5.4: Schematic diagram of reversible level 2. 
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Figure 5.5: Schematic diagram of revlnvl_l module. 

along so far, we do not need to compute the inverse of this addition to produce the delayed 
copy of the inputs. This is because we already have these inputs in the registers that are 
passing A and B them. 

The 8-bit adders used in SCRL-1 consist of a cascade of 8 1-bit full adders with mul- 
tiplying AND gated at the inputs to make an 8-bit partial product adder. In general a 
1-bit multiplying full adder could look something like Figure 5.6-a. 10 performs the bit- 
wise ANDing to produce the partial product. II generates the sum without including the 
carry-in which is also serves as the carry Propagate signal. 12 produces the carry Generate 
signal. The full sum comes out of 13 while the carry for the next stage is produced by 
15. The trouble with this circuit is that an 8-bit adder built using it contains a carry path 
that is roughly 16 logic levels deep. Earlier we have stated our desire that each SCRL-1 
adder completely propagate all the carries before the next one is to proceed. The problem 
is that each stage of SCRL-1 pipeline is at most 2 logic levels deep. We solve this problem 
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Figure 5.6: (a) Schematic Diagram of a generic 1-bit propagate-generate full adder, (b) 
Diagram of the 2-level 1-bit adder that are used in SCRL-1 chip. 

by redesigning the adder to look like the one in Figure 5.6-b. 

In our implementation of SCRL-1, 10 is lumped with 12 into one logic level by expanding 
the terms of the XOR. This produces the carry Propagate signals in one logic level. The 
action of the logic producing this Propagate signal is of course controlled by the slow rails 
of the stage. With the Propagate signal generated and stable, the multiplexors 13 and 14 
are ready to receive signals on their inputs without any dissipative events. At this stage 
the action of the fast rails activate the logic of 12. In addition, the carry-in for the least 
significant 1-bit adder in the 8-bit chain is also driven active by the action of the fast 
rails. With this in mind, we see that SETting the fast rails succeeds in producing all the 
intermediate carries as well as the correct 9-bit result of the addition. The negation at the 
input of the multiplexor 13 contains no logic since all the signals are available in their dual 
form. 

With random data, this adder is just as fast as carry-save adders and since the network 
that the fast rail activates is set up in advance, this adder is not much slower even in the 
cases where the carry has to propagate for a few places. 

In reversible level 2, module revf irstadd8 contains 9 1-bit adders. This is because 
this adder retires the first two partial products producing a sum that span 10 bits. For 
this reason also, the adder in this level differs from other adders in SCRL-1 in that it has 
enough AND gates to produce two partial products instead of one. 
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Figure 5.7: Schematic diagram of reversible level 3 module. 

5.3.3 Reversible Level 3 

Figure 5.7 shows the block diagram of reversible level 3 of SCRL-1 chip. All the sub 
blocks in the diagram are familiar excepts the instances 128, 127, 15 and 16 containing 
the new module revMergel_2. The function of this new module is quite simple. Presently 
all the outputs from the previous level are present in their dual forms. However some of 
these bit are not going to be used anytime later. These include the bit from B that were 
used in the previous product as well as the least significant bits of the total product. For 
these signals, it is a waste in circuit area, power consumption as well a wire area to keep 
them in dual form. For this reason the revMerge 1_2 modules were added in the path of 
these signals to take in the dual form and output the true copy and to do it in a reversible 
manner. 

As a side note, module names differing only in the numbers appearing at the end of 
their names are identical in function. For example modules revlnvl_l and revlnvl_4 are 
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identical in function. The difference in their numbers reflect a difference in their device 
sizing and in their driving capability. 

In this level, module revf ulladd8 is an 8-bit reversible adder. It contains an adder in 
the forward direction that conditionally sums A to the partial sum S based on the value of 
62 • In addition it contains a subtractor that takes the output produced by the adder in the 
forward direction and conditionally subtracts from it A to produce a delayed copy of the 
inputs to this reversible level. Please note that revfulladd8 like revf irstadd8 contain 
reversible registers to pass through a copy of A to the next reversible level. 

5.3.4 Reversible Levels 4-8 

Reversible levels 4 through 8 are all very similar to reversible level 3. Some modules 
in these reversible levels have names similar to the familiar revSplitx_x or revMergex_x 
modules except that they are prefixed with the letter n. For split type modules, this 
indicates that the input data is in complement form. For for the the merge type modules, 
this indicates that the output of the merged true and complement is the complement copy. 
This is necessary so that the merged output would agree in polarity with the other bits 
that were merged previously and have since been going through an inversion every pipeline 
stage. 

5.3.5 Reversible Level 9 

Figure 5.8 shows the block diagram of reversible level 9. In SCRL-1, the addition of the 
last partial product is performed in reversible level 8. Since this addition is the last to be 
performed, the dual form of A is no longer necessary. In addition, since the sum out of this 
addition is not going into a subsequent adder, its dual form is also not needed. For this 
reason, reversible level 9 contains two 8-bit merging blocks, revMerge8_2. One is used to 
merge the bits of A while the other is used to merge the sum bits to yield the 8 high bits of 
the total product. As shown in the diagram, reversible level 9 has some additional merge 
and inv blocks to merge and pass both the B bits as well as the least significant bits of the 
total product. 

In the diagram, rOutLow<7:0> are the least significant bits of the total product while 
rOutHigh<7:0> are the most significant bits. Also bOut<7:0> and aOut<7:0> are true 
copies of the operands A and B respectively. 

5.3.6 Reversible Level 10 

Figure 5.9 shows the block diagram of reversible level 10. This level performs two 
functions. The first function is to take the 16 bit product produced by the previous level and 
to increase its drive 16 times through the revBuf f erl_16 modules in preparation for driving 
the very large devices at the input of the output pad drivers. Internally, each revBuf f erl_16 
module consists of a non-inverting SCRL buffer made of two levels of inverters. The drive 
is increased by 4 at each of the two inverter stages to achieve a drive of 16. The output 
bits rOut<15:0> are routed directly to the output pin drivers. They are also routed to 
reversible level II for a reason to be discussed later. 
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Figure 5.8: Schematic diagram of reversible level 9 module. 
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Figure 5.9: Schematic diagram of reversible level 10 module. 

The second function of reversible level 10 is to non-dissipatively terminate the pipelines 
carrying the bits of A and B. Ideally, these bits should be routed to output pins of SCRL- 
1, and would act as inputs for the reverse pipeline during certain operating phases of the 
chip. Without that, reversibility would be broken and bits of A and B arriving at this 
level would "fall on the floor" thus generating information entropy and dissipating CV 2 
per bit dropped. Unfortunately, the pin budget of SCRL-1 precluded routing A and B to 
the chip pins. In practice, one would not worry much about this dissipation as it would be 
small compared to the total dissipation of the chip. However, since SCRL-1 was designed 
to measure the energy saving of fully reversible circuits, reversibility had to be "faked". 
While testing SCRL-1, the input patter would be held constant yielding the same result 
each cycle. Therefore, if we latch the values of A and B in some static register, we can later 
use them to drive the reverse pipeline at the place where it was terminated thus simulating 
reversibility and avoiding dissipation. It is this function that the modules pipelineStop 
perform in reversible level 10. The enable signal in an input pin in SCRL-1 and is controlled 
to lock-in the value to be used to drive the terminated pipelines when needed. Under normal 
operation, these latches could be filled with values that are more likely to occur based on 
the non random nature of the inputs. In the event that a latch value equals the incoming 
bit, SCRL behavior is preserved and dissipation is eliminated. Therefore in so far as the 
latched pattern accurately predict the input values, dissipation is reduced. 

5.3.7 Reversible Level 11 

Figure 5.10 shows the block diagram of reversible level 11. In a true SCRL implemen- 
tation, the pins carrying the result out of the SCRL-1 chip would also act as inputs to the 
reverse pipeline at some time during the cycle. This is because each line in SCRL is really 
both an output for one direction of the pipeline and an input for the other. Adopting this in 



66 



f11SlowRail<0 

f11FastRail<0 

M1Poss<0 



r11SlowRail<0:1> 

r11FastRail<0:1> 

M1Pass<0:1> 





1 <7-0> 










™ 


<1S-R> 
























_■ 






















w 








* 


w 

» 









aln I<32:39> 
fSlowRail<0:1> 




rSlowRail<0:1> 


fFostRoil<0:1> 


rFastRoil<0:1> 


fPass<0:1>revBuffer16_ 


rPass<0:1> 



I<40:47> 

■ fSlowRoil<0:1> rSlowRaiK 

fFastRail<0:1> rFastRoiK! 

- fPoss<0:1>revBuffer16_1 rPoss<0 



oOut 
> 



Figure 5.10: Schematic diagram of reversible level 11 module. 

SCRL-1 would have meant that the circuit connected to the output pins of SCRL-1 should 
also be of the SCRL type, so as to drive the reverse pipeline when needed. Since SCRL-1 
is the only SCRL chip around, we needed the output pins of SCRL-1 to be true output 
lines with no need to drive them from the outside world to preserve the functionality of the 
internal reverse pipeline. With no signal driving the reverse direction, the pipeline carrying 
rOut<15:0> is now terminated with a break in reversibility. This is the same problem 
that we had when terminating the pipelines of A and B and to that end we use the same 
pipelineStop modules we had used before to reduce the effect of this break in reversibility. 

Their is one difference from the previous case here however. The lines carrying rOut<15:0> 
have a drive of 16 at this point. This leads to 16 times more dissipation than the case with 
A and B whenever the prediction in the latches disagrees with the current value of the data. 
For this reason, we postpone the use of pipelineStop modules for reversible level 12 while 
using this level to reversibly reduce the drive of these signals to minimum according the 
technique described in Section 4.8.2. This reduction is done in the revBuf f erl6_l modules 
shown in the diagram. 

5.3.8 Reversible Level 12 

As mentioned in the previous section, this level contains 16 pipelineStop modules to 
reduce the dissipation of breaking reversibility at the end of the result pipeline. 

5.3.9 Rail Connections 

Figure 5.11 shows the connections of the control rails in SCRL-1. The module revMult8 
contains within it the 12 reversible levels described above. Up to this point, we have the 
ability to clock SCRL-1 in accordance with any of the clocking variants that were described 
in Section 4. However, this would require routing all the control rails of the revMult8 
module in Figure 5.11 to pins on the chip package. Restricting the ability to clock the chip 
to one clocking variant considerably reduces the needed pins. For this reason, I decided 
that SCRL-1 would only accept the 4-phase clocking scheme described earlier. In this 
scheme, the rails controlling level N in the pipeline have the same timing as the rails in 
level N + 4 and could therefore be connected to them. Examining Figure 5.11 we see 
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Figure 5.11: Schematic diagram illustrating the connection of the rails in SCRL-1. 

how such connections reduce the number of pins that are needed for the control rails. 
In the diagram, the signal enableAB is the enable signal that controls the latches in the 
pipelineStop modules of A and B while enableR controls the latches for rOut lines. 

5.3.10 I/O Connections 

Figure 5.12 shows the input/output connections of SCRL-1. This diagram is at the 
highest level of the schematic hierarchy of SCRL-1. The internal details of module railsl 
in the diagram are the contents of Figure 5.11. In this diagram we see the output pad driver 
module, outputPad. We also see the modules inputPad which contain the ESD protection 
for the inputs to the chip. If we add the I/O lines of the railsl module we get a count 
of 82. Adding to that the separate rails that drive the output pads, pads Slow Rail <0:1>, 
and the implicit Vdd an( i GND lines to bias the P- Wells and N- Wells of the chip we get 86. 
Unfortunately the MOSIS package provided for the size of the die we are using had only 
84 pins. To fit within this package, I eliminated the most significant bits from the A and 
B operand inputs to SCRL-1 and internally shorted aln<6> to aln<7>, and bln<6> to 
bln<7>. While SCRL-1 is internally an 8 X 8 multiplier, externally we can only feed 7 bits 
of each operand. 
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Figure 5.12: Schematic diagram illustrating the I/O connections in SCRL-1. 

5.4 Input Pads 

SCRL-1 contains a number of input pins with differing requirements. Different input 
pads are used by fast and slow rails, by pass rails, and by operand inputs. Each with its own 
unique ESD protection requirements. The competition in SCRL is between maintaining 
adequate ESD protection while still minimizing energy dissipation because of large diffusion 
resistors. 
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5.4.1 Operand Input Pads 

These are the pads used to input A and B. They only drive the gates of some devices in 
reversible level 1. Since very little current is driven through these rails, indeed the current 
could be zero for static input patterns, a relatively large polysilicon resistor is included in 
the path. This is followed by 2 reverse-connected ESD protection diodes and finally the 
signal drives the gates of two large complementary transistor with the source and drain 
of each shorted together and connected to the respective Vdd or GND. The function of 
these devices is to position a large capacitance at the end of the polysilicon resistor thus 
increasing the RC time constant of the input path and allowing the diffusion diode time to 
conduct. 

5.4.2 Slow and Fast Rails Input Pads 

Both the slow and the fast rails use the same pad design. Note here that the slow 
and fast rails are the rails through which all the internal node of the computing circuits 
are charged and discharged. Adding a resistor in their path directly affects the energy 
consumption of the chip. Fortunately, these rails are connected to the sources and drains of 
a large number of devices which result in a large RC time constant. Because of the above 
reasons, the input pads for these rails contain only ESD diffusion diode and direct metal 
path does exist between the bonding pad and the rails' distribution networks. 

5.4.3 Pass rails Input Pads 

We remind the reader here that the pass rails are used solely to control the states of the 
pass gates at the output of every SCRL gate in the circuit. Since this means that a pass 
rail directly drives the gates of a number of devices, we have to be more careful in our ESD 
protection to prevent damage. For this reason, the pass rails' input pads include a ~ 200 
polysilicon resistor to slow down transients enough for the reverse biased protection diodes 
to conduct. 

5.5 Output Pads 

Figure 5.13 shows the schematic diagram of the output pads used in SCRL-1. It consists 
of two devices large enough to drive the pin and load capacitance of the output. As for 
the rest of SCRL-1, these devices are controlled by two swinging control rails and are not 
connected to Vdd or GND. Following those devices are two back biased devices to provide 
ESD protection. As in conventional CMOS, little added protection is needed in true output 
pads because the relatively large size of the driver itself provides adequate protection. The 
back biased devices were added for the output pads in SCRL-1 to provide a conduction 
path to Vdd or GND. This is because the devices in an SCRL pad driver have no connection 
to either V d d or GND. 
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Figure 5.13: Schematic diagram of SCRL-1 output pad circuit. 

5.6 Layout of SCRL Circuits 

Figure 5.14 shows the typical layout for an SCRL chip. Similar to laying out conven- 
tional chips, the layout of SCRL chips starts with the design of the common cells. Unlike 
conventional CMOS however, the cell library in SCRL must contain additional cells that 
compute the inverse functions of the modules in the cell library. Figure 5.14 shows how each 
forward computing cell is paired with its inverse cell which is part of the reverse pipeline. 
Functionally, each forward cell is always paired up with the same inverse cell. Furthermore, 
the outputs and inputs of each forward cell always go to the corresponding inputs and 
outputs of the reverse cell. Because of these two facts, the layout synthesis of the cells in 
SCRL-1 was performed on design schematics that already contained the circuits for both 
the forward and the reverse sub cells. This increases the room for layout optimization with- 
out affecting the generality of the library since no sub cell is ever used without its inverse. 
It also cuts down on the wiring because the I/O connections between the forward and the 
reverse cells are optimized internal to the encompassing cell. We define the library cells 
that contain both the forward and reverse circuit components as reversible cells. 

Reversible cells that run horizontally next to each other and that share the same forward 
and reverse control rails are called a reversible level, e.g., level 1. Figure 5.14 shows a sketch 
of the layout for a six level SCRL pipeline. In the figure, each level contains three separate 
reversible cells. A 2-stage SCRL reversible cell needs 6 forward control rails to control the 
forward part of the reversible cell and 6 reverse control rails to control the reverse part. The 
cell also needs Vdd an d GND to bias the substrate and the wells in its circuits. Since these 
signals are common to all of the reversible cells on the same level, the distribution of these 
rails occurs along channels that are parallel with each reversible level. For example, R_Rail 
1, which refers to the 6 control rails that are needed by the reverse cells of level 1, are 
channeled horizontally and immediately following the first reversible level of the pipeline, 
level 1. The signals in F_Rails 1, which refers to the 6 control rails that are needed by 
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Figure 5.14: Sketch of a typical layout of an SCRL chip. 
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the forward cells of level 1, are channeled horizontally and immediately preceding, and in 
parallel with, the first reversible level of the pipeline, level 1. By making sure that all the 
reversible cells in the library have the same height we can always produce circuits with 
highly optimized and regular layouts. 

The layout in Figure 5.14 is that of a 4-phase SCRL circuit. As such, the rails controlling 
the cells in level a are the same as the ones controlling the cells in levels (a ± 4) and hence 
should be joined. This is accomplished by the rail channels that run vertically in the layout. 
In SCRL-1, the vertical channels for the forward rails are on the left side while the channels 
for the reverse rail are on the right. The choice is obviously arbitrary. 

Finally, the sketch in Figure 5.14 simplistically shows that the I/O lines from each 
reversible cell go to the cells that are directly above or directly below it. This is not 
the case in general. Outputs from a reversible cell could go to any reversible cell in the 
subsequent level of the pipeline. The cross routing of these I/O signals is generally routed 
along side the horizontal channels of the control rails. 

5.7 Design Entry and Verification 

The chip was designed using Cadence custom layout tools. The circuits were entered 
in schematic form and were verified with verilog simulation files. Each node in the circuit 
was treated as a trireg node in the verilog model in order to detect charge sharing. After 
verification, the cells' layout were synthesized using Cadence synthesis tools. The resultant 
layout was then optimized by hand. Upon completion, the layout was compared to the 
verified schematic including verification of device sizing. The verification included the 
circuits all the way to the bonding pads of the chip in order to detect all the wiring errors 
and to guard against incorrectly biased protection diode. An HSpice model was produced 
for some of the key cells in the design and simulated to verify functionality when parasitics 
are included. 

The entire design resides in the scrl cadence library in ~younis /cadence directory in 
at the AI Lab. The chip was fabricated using 2 micron P-Well process by ORBIT through 
MOSIS. 
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6. Test and Measurement Results 



6.1 Introduction 

Testing of SCRL-1 was to accomplish two objectives. The first objective was to verify 
that SCRL-1 was indeed a working 8 reversible logic multiplier. The second was to show that 
circuits that are built using SCRL techniques consume considerably less energy than their 
conventional CMOS counterparts. Specifically, the energy consumption measurements were 
to show the asymptotically zero energy consumption behavior of SCRL circuits at lower 
operating frequency. 

For the first objective, I was able to verify the complete functionality of SCRL-1 as 
the design intended, including exhaustive testing of all I/O patterns. The second objective 
was more difficult to accomplish however. Power measurement of low power circuits is 
a tricky process. Energy measurement of low power circuits, as it turned out, is even 
more so. While I was able to show that the energy consumption of SCRL circuits is well 
below that of conventional CMOS, I was unable to measure exactly how much SCRL-1 was 
actually consuming. This is because the actual magnitude of this dissipation was buried 
below the dynamic range of the instrument used, and any attempt to get meaningful data 
was swamped by roundoff errors. Not knowing the exact values of dissipation at various 
frequencies made it impossible to verify the predicted asymptotic behavior of SCRL circuits. 

In this section I will outline the testing procedures that I used on SCRL-1, and will 
conclude with suggestions on how further testing could proceed. 

6.2 Digital Functional Testing 

In this phase of testing no attention was paid to the energy requirements and hence all 
signals supplied to the chip under test where either or 5 Volts. To achieve this objective, 
I built a PC interface with a number of output and input lines. Each of the output lines 
could be independently set under software control. All the lines in the link connecting the 
PC side to the test fixture side were differential signal lines which therefore allowed the test 
fixture to float with respect to the power and ground lines of the PC. This was done to 
eliminate ground loops as well as to isolate the test fixture from the PC noise. C programs 
running on the PC sequenced the input lines of the SCRL-1 under test and read back the 
produced outputs for comparison. 

6.2.1 Testing of Forward Pipeline 

To test the function of the forward pipeline, the test fixture was set up as follows. First, 
all the slow and all the fast rails were connected to either Vdd or GND depending on their 
polarity. This is effect makes all the SCRL circuits in the chip identical to conventional 
CMOS. Secondly, each of the pass rails controlling the pass gates in the reverse pipeline 
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were connected to either Vdd or GND so as to turn all the reverse pass gates off. Having 
done this, SCRL-1 is now reduced to a pipelined conventional CMOS multiplier with the 
forward pass gates serving as pipeline registers. 

The first test was to turn all the forward pass gates on so that the whole chip is in effect 
one combinational level deep and any change in the inputs would show up at the outputs. 
Then all the possible combinations of operands were introduced and the resulting outputs 
sampled. The chip produced all the correct results. 

The second test was to verify the function of the pass gates in the forward direction. 
In this tests, the state of the forward pass gates were sequenced in the same order as that 
under normal SCRL operation. After the predicted number of clock cycles, the correct 
outputs were observed. This was agian done for an exhaustive set of input combinations. 

The third test was to verify the function of the latches that are used in the pipelineStop 
modules of SCRL-1. To test these latched, SCRL-1 was sequenced by controlling the pass 
gates in the forward direction until a piece of data arrived at the latches inputs. The data 
was then latched. Then all the previous stages of SCRL-1 were disabled and the output 
from the latches were enabled by enabling the reverse pass gates at their outputs. At this 
point, the content of the latches provided the only source to drive the node from which the 
data had been sampled. Now, the forward pipeline was allowed to proceed from that point. 
If the latches worked correctly, the values at the outputs should not give any new results. 
If however, the value from the latches differed from the sampled data, the outputs would 
be corrupted by the insertion of inconsistent values from the latches in the middle of the 
forward pipeline. The latches of pipelineStop modules locking the values of A, B and R 
all passed this test under all input combinations. 

6.3 Testing of Reverse Pipeline 

Even though SCRL-1 is internally a fully reversible implementation, externally both the 
operand input pins and the result output pins are one-directional. Therefore, testing the 
reverse pipeline by injecting the result at the outputs and watching for the state of the inputs 
is not possible. For this reason, the testing of the reverse pipeline had to be done indirectly. 
In this test, the reverse pipeline was at first fully disabled and the forward pipeline was 
allowed to proceed for a few cycles to set up some values on the nodes throughout the 
pipeline. Next, the forward pipeline was disabled and the reverse was allowed to proceed 
for a few cycles in effect back tracking the computation. Finally the direction was again 
reversed and the forward pipeline was allowed to proceed all the way to the outputs. In 
the event that the reverse pipeline was working properly, back tracking the computation 
for a number of cycles should not affect the final result. SCRL-1 passed this test under all 
input combinations. 

Having passed all of the above tests, it was evident that all the components of this silicon 
version of SCRL-1 were, from a digital point of view, working correctly. Even though in 
retrospect one does radiate confidence when asked about what he felt the chances were, I 
have to admit that seeing reversible logic predict the right value in silicon for the first time 
was heartening. 
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6.4 Split-Level Operational Testing 

To test the chip for SCRL operation, one must sequence the slow and fast rails between 
their SET and RESET states in addition to sequencing the Pass rails. However, the slow 
and fast rails have Vd.d/2 as their RESET level and therefore could not be driven by 
simple logic devices. Furthermore, to test for the linear relationship between the energy 
consumption and the clock frequency, the rise and fall times for all the rail had to be 
controlled. In essence the testing circuit must achieve the following. It must provide the 
correct voltage level to all the rails that are not moving, and swing those that are moving 
in a controlled and timely manner. To this end a number of test circuits were designed and 
tested. The first was based on Direct-Digital-Synthesis approach to generate the swinging 
signal. It was eventually abandoned in favor of a much simpler test fixture based on RC 
controlled square waves. What follows is a description of these circuits. 

6.4.1 Direct Digital Synthesis Testing Approach 

This was the first circuit built to test the SCRL operation of SCRL-1. With Direct 
Digital Synthesis (DDS), a waveform is directly generated by continuously feeding a Digital- 
to- Analog Converter (DAC) a stream of numbers representing the voltage values for the 
desired waveform. The test fixture based on DDS consisted of an array of analog switches 
in conjunction with two very high speed digital-to-analog converter modules (DAC's) to 
generate the controlled voltage ramps for every control rail in SCRL-1. 

During SCRL operation, each rail goes through 4 different phases of operation. In the 
first phase, the rail is taken from the RESET voltage level of the rail to the SET voltage 
level. During the second phase, the rail is held at its SET level while other rails swing. In 
the third phase, the rail is taken back from its SET level to its RESET level. Finally, during 
the forth phase, the rail is held at its RESET voltage level. During the phases where the 
rail is to be held at a certain voltage level, that rail cannot be left floating. This is because 
capacitive coupling from neighboring rails can shift the voltage on the rail and hence lead 
to dissipation later on. 

Conforming to the above, our first test fixture contained a separate 4 way analog switch 
for each rail. Two of the switch positions were connected to the RESET and SET voltages 
of that specific rail. The other two positions of the switches corresponded to the SETting 
and RESETting phases of operation and were connected to a ramp generator. To SET a 
rail, the position of its analog switch would be set to connect the rail to the ramp generator. 
The generator is then triggered to produce a gradual and controlled voltage ramp that is 
fed to the rail through the switch. Once the ramp reaches it final voltage value, the analog 
switch of the rail is then thrown to the position corresponding to the SET voltage level of 
that rail. This helps keep the rail voltage from wandering. A procedure symmetric to the 
one used for SETting is employed during RESETting. 

The analog switch array consisted of 24 74HC4052 analog switches. They were all biased 
with Vee = — 5 Volts to give a relatively constant feed through resistance throughout the 
entire range of the rails swings of to +5 Volts. The sequencing of these switches was done 
by a finite state machine that can be programmed from a PC. 
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Figure 6.1: Block diagram the high speed DAC module. 

SCRL operation depends on the gradual and controlled rise and fall times of the rails. 
For testing purposes, we must have the ability to vary these rise and fall times. In addition, 
our ramp generator must be able to operate at frequencies that are high enough so that 
the dissipation of the circuit does not become limited by leakage currents. To achieve the 
above a pair of high speed DAC modules were designed and constructed. Figure 6.1 shows 
the block diagram of one of these modules. The first component of the module is a 50MHz 
DAC made by Qualcom Inc. (Q2510). It was used because of its impressive speed and low 
noise coupling from the transitions of the digital inputs. In our design, we clocked this DAC 
at 32MHz. The Q2510 is a current mode DAC and the output was fed into a 50O resistor 
to convert the current signal to a voltage signal. This 50O resistor also became the source 
impedance of the DAC. In addition to the desired signal, the DAC has a strong frequency 
component corresponding to its clocking frequency, the fundamental. This was suppressed 
by including a 7-pole elliptic filter at the output of the DAC. To eliminate reflection and 
maximize power transfer, the input and output impedances of this filter were designed to 
be 50O. From this we see that the effective impedance as seen by the output pin of the 
DAC is now 250. Since the Q2510 has a full scale current of 20mA, we see that the full 
signal swing out of the filter is about 0.5V. For this reason, an operational amplifier was 
added and configured as a X 10 inverting amplifier. The amplifier selected for the task was 
a National Semiconductor LH0032 operational amplifier with a slew rate of 500V//US. The 
amplifier was followed by a 250mA current buffer so that the source impedance of the whole 
DAC module is low enough to drive the load of any rail. 

The 10-bit data words were read in from a cyclical SRAM data buffer that was previ- 
ously loaded from the PC. The data buffers had space for four different waveform patters 
corresponding to the SETting and RESETting of both the slow and fast rails as well as 
the SETting and RESETting of the Pass rails. At this point, the PC is able to program 
the sequence of states of the analog switches as well as to download the contents of the 
cyclical SRAM buffer. This in effect provided us with a very flexible and programmable 
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fixture that is able to control the level and the rate of change of the voltage on each of the 
control rails of SCRL-1. The finite state machine controlling the states of the switches had 
extra bits to select which of the four waveform buffers to use. It also had enough bits to 
configure the data inputs to the chip under test. 

After some debugging, I managed to get this test circuit described above to work as 
designed. There was a number of problems with this approach however. The first and 
foremost was noise. To generate voltage ramps with the desired rise times, the DAC had 
to run at ~30MHz. This meant that in addition to the DAC's, all the components of the 
cyclical SRAM buffers with their counters and registers had to be clocked at the same rate. 
The toggling of all of these digital nodes at such a frequency injected some noise into the 
ground plane of the analog part of the circuit. For general use, that noise would not have 
been a problem. For SCRL however, the noise could lead to false readings. Recall that in 
SCRL, the energy consumption is linearly related to the slope of the signals at the control 
rails. Slope here refers to the instantaneous slope and not the average slope. The fact that 
the average slope is small is not enough to reduce the energy dissipation of the circuit if 
the signal had a high frequency component riding on the average transition. In this case, 
the slope affecting the dissipation is really the slope of the high frequency component even 
though it could be much smaller in magnitude than the slow moving average component. 
Multiple attempts to suppress this noise did not significantly improve the purity of the slow 
moving signal. 

The second problem involves the harmonics that are generated by the DAC. Even though 
the elliptic filter correctly removed the fundamental sampling frequency as well as all the 
harmonics above it, it could not remove the subharmonics that Direct-Digital-Synthesis 
(DDS) produces. In so far as the generated frequency out of the DAC is relatively close 
to the cutoff frequency of the DAC, we could expect very nice smooth waveforms from our 
module. At lower frequencies however, the folding of the spectrum results in a staircase 
shape of the waveform as the time between samples of different values becomes large. As 
we have shown in Section 2, the dissipation of an SCRL circuit that is driven by a staircase 
waveform is inversely related to the number of steps in the waveform irrespective of the 
average rise time of the full swing. From this we see that once the generated waveform is 
slow enough for the staircase effect to appear, slowing down the rise time, does not affect 
dissipation and therefore could yield incorrect test results. One way to fix this is to use 
a switched capacitor filter with a variable cutoff frequency in place of the fixed frequency 
elliptic filter. By always changing the cutoff frequency of the filter so that it was always 
just slightly above the average frequency of the rail, we could greatly reduce the staircase 
effect in the generated ramp. The problem is that it was difficult to find programmable 
filters with a pass band at the high frequencies of operation that we desire. 

Finally, the above test circuit clocked at 30MHz could not generate signal at frequencies 
higher than ~lMHz with any accuracy. This is because we need a minimum number of 
points from a DAC for each swing if we are to produce that swing with sufficient control 
on the rise and fall times of the generated ramp. 

Because of the above reasons, this first test fixture with its approach that relied on DDS 
were eventually abandoned. 
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Figure 6.2: Block diagram of the components of the i?C-controlled voltage step testing 
approach. 



6.4.2 EC-Controlled Voltage Step Testing 

In this method, the waveforms were generated by passing a voltage step through a resis- 
tor that would, with the aid of the equivalent capacitance of the rail, slow down the rise/fall 
time to a predetermined rate. Figure 6.2 illustrates the components of this approach. As 
seen from the chip boundary, each rail could be modeled by an equivalent capacitance in 
series with an equivalent resistance. The effective time constant of the rail,r eg , is then 
ReqCeq- Adding an external resistance in series with the rail having a much larger value 
than the internal equivalent resistance of the rail, forces the time constant of the new cir- 
cuit , T ext , to ~ R ext C eq . Since now T ext >> r eg , the rise time of the rail becomes governed 
by R ex t an d could be adjusted to any value less than T eq . As illustrated in Figure 6.2, 
an analog switch is added to toggle the rail voltage between the SET and RESET levels. 
The analog switch is controlled by digital signals. In our test fixture, the above circuit 
is duplicated for every independently swinging rail. In total there are 48 separate rails 
and hence 48 separate switches. For every rail that swings in the positive direction and 
controls the P-Channel devices of the circuit, there is a rail that simultaneously swings in 
the negative direction and controls the N- Channel devices. Since both swing at the same 
time, the analog switches associated with these rails share a control line. This reduces the 
number of control bits to 24. To produce the correct sequencing patterns on these 24 lines I 
used a simple microcontroller with 3 external 8-bit decoded registers. The microcontroller 
was a PIC 16C54 from Microchip Technologies in an 18-pin cerdip package with UV erase 
window. The registers were CMOS 74HC374's to produce full rail-to-rail output drive. 

To guard against ground loops, all the voltage supplies for the test fixture were floating 
supplies with no connections to earth. In addition, the 5V needed to run the microcontroller 
and registers were made separate from the 5V supply that was used in SETting some of 
the rails. The circuit was constructed over a single, copper plate ground. Other needed 
voltages were +2.5V for the RESET level of the slow and fast Rails and -5.0V to sufficiently 
bias the analog switches. By reprogramming the microcontroller, alternate rail sequencing 
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could be tested. 

The clock signal for the microcontroller came from a programmable waveform generator. 
To eliminate a potential ground loop, the clock from the generator was coupled to the test 
fixture trough a differential line receiver. In addition to getting rid of the potential ground 
loop, the differential line receiver was sensitive enough to detect a 500mV peak-to-peak 
signal, and provided enough regeneration to accept a sinusoidal signal from the generator. 
With a low power pure sinusoid fed through differential lines to the test fixture, the noise 
from this high frequency signal was sharply reduced. 

In general, the microcontroller programs started by activating only the forward pipeline 
for a number of cycle that were enough to consistently setup the pipeline. Following that 
the reverse pipeline was activated thus starting true SCRL operation. Even though the 
functionality of the reverse pipeline was already tested logically, this test fixture provided 
the first opportunity to test the entire chip under true SCRL operating conditions including 
the splitting and restoration of the slow and fast Rails. The chip again passed this test. 

6.5 Energy Measurement Procedures 

Figure 6.3 shows the location of the scope probes that were used in measuring the 
energy consumption. The probes were connected to an HP54IID digitizing oscilloscope. 
The oscilloscope was connected to a PC using the GPIB interface and software drivers. 
A program on the PC side calculated the optimal oscilloscope settings, programmed the 
scope, and retrieved the data samples from the scope for analysis automatically. In our 
setup, the time during which a rail was moving, was much smaller than the period of 
operation of that rail. With this in mind, having enough time resolution on the scope 
screen to accurately observe the rise time of a rail, meant that a single period would span 
multiple scope windows. For this reason, the program running on the PC also calculated 
the number of windows that are spanned by a single period and would then control the 
scope to give multiple delayed snap shots so as to cover the entire period with sufficient 
time resolution. To reduce sampling noise, the scope was set up so that each reported data 
point was the result of averaging 10 different samples takes as different times but having 
the same relationship to the triggering event. In averaging mode, the HP54IID gave 501 
points for every window snap shot per channel. Therefore a measurement spanning multiple 
windows to cover the whole period could yield more than a few thousands data points. 

To compensate for probe offset, the PC would take one measurement trace while the 
rail voltage was held at 2.5V. It would then take the average of all of the 501 points for 
each channel and subtract 2.5V from it to get the probe offsets of each channel. Later on, 
all the readings from the scope were adjusted by these offsets before they were used in the 
calculations. 

Knowing the value of R ext for the rail under measurement, we can calculate the power 
into the rail pin from the following equation. 

KaU = p (0.1) 

This yields the instantaneous power consumed at a point in time. Averaging this value 
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Figure 6.3: Scope probe locations during energy dissipation measurement. 

with the value of the previous data point and multiplying the result by the difference in 
time between the measurements yields the energy packet that was consumed by the rail 
during the interval between the consecutive data points. The above procedure is nothing 
more than the time integration of power over the interval using the trapezoidal rule to get 
the energy. Adding all these packets of energy from each time interval over exactly one 
period yield the total energy consumed by the rail during a period of operation. Here the 
i?C-controlled voltage step approach provides more accuracy than the previous approach 
since the potential drop across R ex t is always large, almost full Vdd a t some times, since 
R ex t is by design >> R eq . This is in contrast to the DAC approach where the rise time is 
controlled by the DAC and hence any resistor that is inserted to measure the current would 
yield a small voltage difference, probably in the lOmV range. 

To measure the energy consumption at a given rise time for the entire chip, the above 
procedure is repeated for every rail and their energies all added up. The reason for the direct 
measurement of energy is that we wanted to examine the energy that is dissipated inside the 
SCRL-1 only. Inserting an external inductor would have made our measurement subject 
to the Q of the inductor and we then would have had trouble separating the dissipation 
components. 

At one point it was though that an operational amplifier connected so as to directly 
compute (V2 — V\) would greatly increase the dynamic range of the measurement through 
the elimination of the common mode voltage. Unfortunately, after constructing the circuit, 
the common mode rejection ratio of this and other high speed amplifiers deteriorated at 
high frequency to almost dB. At near db rejection ratio, the difference between common 
mode and differential voltage is non existent making this approach useless. Therefore this 
approach was abandoned. 

6.6 Test Results 

As I have mentioned earlier, I was not able to directly verify the linear relationship 
between rise/fall time and energy dissipation that the theory predicts. Indirectly however, 
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all other measurements seemed to confirm SCRL operation as predicted by both theory and 
SPICE simulations. This strengthens my conviction that the undetectability of the linear 
behavior was more due to insufficient accuracy in the measurement techniques rather than 
to an unforeseen flaw in the theory. 

6.6.1 The Problem of Direct Measurement of SCRL Energy 

As stated earlier, our test fixture measured the dissipated SCRL energy directly. This 
meant simultaneously and repeatedly sampling the voltage at the rail and the current into 
the rail for one full period of operation and then numerically integrating the calculated 
instantaneous power over the period to get the dissipated energy. It then became obvious 
that direct energy measurement of SCRL circuits is nearly impossible. To show this, let 
us track the energy flow for a fast rail that swings between 2.5V and 5V. When the rail is 
driven from 2.5V to 5V, the driving circuit delivers a certain amount of energy to the rail, 
we call EseTi where 

Eset = E Ceq + E Req (6.2) 

In the above equation, the first energy component is the amount of energy that is needed 
to charge the equivalent rail capacitance, C eq , from 2.5V to 5V. The second is the energy 
that is lost in R eq as the result of charging C eq through R eq . When the rail is driven back 
to 2.5V, the energy delivered to the rail, Ereset, is equal to 

Ereset = -Ec eq + Eu eq (6.3) 

Hence the total energy consumed is 

Eset + Ereset = 2ER eq (6.4) 

The problem with measuring this directly is that Ec eq is very nearly equal to the energy 
consumed by conventional CMOS circuit in each cycles. In contrast, Er cc[ is predictably 
orders of magnitude smaller than Ec eq for all operating frequencies except very close to 
the maximum operating frequency of the SCRL-1 chip. Since every branch in SCRL-1 has 
been properly sized to reduce energy consumption, and since each rail drives a circuit that 
is only one logic level deep, the internal speed of each stage of SCRL-1 is somewhere below 
a nanosecond. This means that for all operating frequencies below 100MHz, Er cc[ will 
literally be orders of magnitude below Ec eq - The HP5411D only has 6-bits of resolution at 
our operating range. In addition, each rail voltage was slightly above half of the full scale 
deflection of the scope and therefore the resolution was effectively only 5-bits. With this 
resolution, anything below one part in 32 is buried under the quantization noise. Anything 
that is orders of magnitude below the maximum is obviously, undetectable. For the above 
reason, we see that the way in which the energy flows in and out of a rail in SCRL, renders 
direct measurement impossible. Please note that in order to get energy measurement 
spanning multiple decades of operating frequencies, we would need an instrument with 
a dynamic range of multiple decades. Therefore, even though a more accurate instrument 
would get us closer to the desired quantities, no instrument could give us multiple energy 
measurement points spanning multiple decade of operating frequency is which what is 
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Table 6.1: Energy measurements for rail FF1( + ) in SCRL-1. 

needed to verify the asymptotic behavior of SCRL circuits. For this reason, SCRL energy 
measurement must be done indirectly. 

Table 6.1 lists energy measurements for positive SETting Forward direction fast rail 
#1, FF1( + ). To measure Ec eq , the power integration was started immediately before 
the SETting of the rail and then detecting the maximum energy value reached during 
the integration. This corresponds to the energy sent into the rail before the direction of 
power flow is reversed. 2Er cc[ was taken to be the value of the integration over the whole 
period. From the table we see that Ec eq is measurable and consistent but 2Er cc[ was not. 
This reflects the fact that it was buried under the dynamic range of the measurement. 
The measurements were also mostly negative resulting from a negative bias from the offset 
voltage of the probes. In addition, the table shows that Ec eq is not a function of rise time as 
we would expect. This is because Ec eq is equal to CMOS dissipation which is not affected 
by rise time. 

6.6.2 Confirmation of SCRL Energy Flow 

Measurements on rails FF3(-) and RF4(-) are listed in Table 6.2. FF3(-) is the ^3 
negative SETting fast rail in the Forward direction. RF4(-) is the ^4 negative SETting 
fast rail in the Reverse direction. FF3(-) is the most loaded rail in SCRL-1. The interesting 
thing in the table is that E, per i 0( i is measurable with some consistency. It is also always 
negative for FF3(-) indicating that this rail is supplying energy to the outside world. The 
magnitude of E, per i 0( i is also large enough that these measurement are not due to noise. 

To explain these results, we recall the sequence of operations for an SCRL gate. In the 
forward direction, a gate grabs the output node while it is at 2.5V and sets it to the correct 
logic level. It then disconnects itself from the node so that it can restore itself without 
affecting the value on that node. The gate driving that node in the reverse direction is 
responsible for restoring it to 2.5V. Therefore, fast rails in the forward directions always 
SET circuit nodes while fast rails in the reverse direction always RESET them. For negative 
swinging rails, SETting a node means removing energy from it so as to discharge it from 
2.5V to 0V. RESETting the node means delivering energy back to it to charge it back up 
to 2.5V. The opposite is true for positive swinging rails. In SCRL-1, FF3(-) is responsible 
for SETting the nodes following the third pipeline stages, while RF4(+) is responsible for 
RESETting them. With this in mind, energy extracted from SCRL-1 through FF3(-) must 
come from RF4(-), and Table 6.2 shows that it does. From the above, we see that Er 
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Table 6.2: Energy measurements for rail FF3(-) and RF4(-) in SCRL-1. 

really corresponds to the difference in E, per i 0( i between the pair of rails that affect the same 
circuit nodes. Subtracting E, per i 0( i of FF3(-) from that of RF4(-) gives us unreliable data 
since once again the real difference is much less than two quantities. The above energy flow 
provides confirmation of SCRL operation in the chip. 

6.6.3 Confirmation of Node Hand-off in SCRL 

Measurements of energy for FF3(-) and RF4(-) confirmed the hand-off that occurs be- 
tween the forward and reverse parts of the pipeline. However, the effect was not detectable 
for other rails. This is because the load on other rails is smaller than the load on FF3(-) 
and RF4(-) in SCRL-1 which resulted in burying E, per i 0( i below the dynamic range of the 
measurement. 

To confirm the effect of the hand-off between the forward and reverse gates, we examine 
the difference between the rise and fall times of these gates. During the rise time, FF1(+) 
charges both the internal nodes as well as the output node of all the gates it controls. During 
the fall time, FF1(+) discharges only the internal nodes of these gates. This means that 
the effective capacitance of FF1(+) during rise time is larger than its capacitance during 
fall time. Figure 6.4-a shows the rise time for rail FF1(+) in response to a square wave 
input with a 2777.50 external resistor. The rise time is approximately 500 ns. Figure 6.4-b 
shows the fall time for FF1(+) in response to a square wave input with the same 2777.50 
external resistor. As we can see, the fall time is a much faster ~ 300nS confirming the 
prediction based on SCRL operation. 

6.6.4 SCRL verses CMOS Operation 

To illustrate the contribution of the reverse pipeline in reducing the energy consumption 
of CMOS circuits, I ran SCRL-1 with the reverse pipeline disabled. Under such conditions, 
all the reverse pass gates were disabled and all the reverse slow and fast rails were tied 
to 2.5V. Figure 6.5 illustrates the waveform trace of FF1(+) rail with correct SCRL con- 
ditions as well as with the reverse pipeline disabled, i.e., conventional CMOS conditions. 
Figure 6.5-a shows FF1(+) under normal SCRL operation. Note that in this mode the 
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Figure 6.4: (a) Scope trace of the rise time of FF1(+) with 2777.50 resistor, (b) Scope 
trace of the rise time of FF1(+) with 2777.50 resistor. 

transfer of charge to and from the rail are completely controlled by the rise/fall times of 
the rail. No uncontrolled charge sharing or voltage resetting occurs and the waveform is 
fairly smooth. 

Figure 6.5-b shows FF1(+) with the reverse pipeline disabled thus mimicking con- 
ventional CMOS operation. For both traces, the forward gate controlled by FF1(+) is 
connected to the output approximately 2.5/uS before the rail start rising. Under SCRL con- 
ditions, the reverse pipeline would have non-dissipatively RESET the value of that node 
to 2.5V. Hence when the hand-off occurs under SCRL conditions, no charge moves and no 
glitch occurs. In the absence of the non-dissipative restoring action of the reverse pipeline, 
there is no way for the circuit to non-dissipatively restore the voltage on that node to 2.5V. 
Hence we observe a large spike around the hand off time of ~ 2.5/uS before the FF1(+) start 
rising. There is also no way for FF1(+) to recover the energy in the spike, since it cannot 
tell the voltage values on all the nodes that are causing it. Integrating the instantaneous 
power in this spike for the duration of this spike we get 0.2 nJ for FF1(+). This is on 
the same order as the total energy delivered to C eq in FF1(+). It is also the amount that 
CMOS dissipated every cycle. From the above we see how the introduction of the reverse 
pipeline have helped eliminate the spike associated with increased information entropy. 
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Figure 6.5: (a) FF1(+) waveform under SCRL operation, (b) FF1(+) waveform with the 
reverse pipeline disabled. 
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Figure 6.6: Non-Inverting SCRL Gate. 



6.6.5 SCRL Capacitive Coupling 

Up to this point we have ignored a small voltage "bump" that occurs just before the 
rail started rising in all of the traces shown so far. The bump was also evident under SCRL 
conditions. The presence of the bump was not a surprise as I have observed it in H Spice 
simulations. In addition, its negligible effect on dissipation should become evident as soon 
as we recognize its cause. Figure 6.6 shows the diagram of a typical 2-level SCRL gate. 
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Figure 6.7: CMOS capacitive from swinging of FS1( + /-) at varying R ex t on FF1(+). 

We will assume that the output pass gate is always on. Starting with valid inputs and all 
internal node and rails at 2.5V, we proceed to SET the first stage by splitting the fast rails. 
If the input was at 0V, the output of the first stage would rise to 5V. While rising, this 
signal would capacitively couple through the gates of the second stage to the output of the 
gate thus moving the output voltage away from 2.5V. This is possible because at the time 
that the fast rail starts to SET, the devices in the second stage are all off and hence the 
output node of the gate is floating. As soon as the output from the first stage climbs a 
few tenths of a volt above 2.5V, the second stage devices start to conduct and the output 
voltage is anchored to 2.5V. This obviously leads to dissipation. Fortunately, the problem 
could be cured in a variety of ways. We can for example introduce a keeper transistor that 
will connect the output of the gate to 2.5V while the first stage swings. 

Another way to reduce its effect is to note that the slower the rise time of the fast 
rails are, relative to R ext of the slow rails, the smaller the size of the bump becomes. 
Figure 6.7 shows multiple scope traces of FF1(+) during which the rise time of FS1(+) is 
kept constant, ~ 500nS, while R ext of FF1(+) is increased. We point the readers attention 
to how the magnitude of the bump, located around — 0.5/iS, increases with increased R ex t- 
Also note the kink in the rise of FF1(+). This corresponds to the time that the rise of 
FF1(+) starts to turn on the devices whose gates FF1(+) is driving. At the point the rail 
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capacitance suddenly increases and the rail rise time momentarily dips due to inductance 
in the rail circuit. Soon after, the rail resumes its rise. Irrespective of the magnitude of the 
capacitive coupling, the magnitude of the voltage bump never exceeds the value at which 
the kink occurs since the sudden added capacitance of the node helps to significantly reduce 
the capacitive coupling to the fast rail. 

6.7 Summary 

As stated in this section, we have been able to verify the complete operation of SCRL-1 
chip. Furthermore, through a number of energy measurements we have been able to verify 
the energy movement from the forward to the reverse rails as well as the predicted differ- 
ence in capacitive load for each rail under SCRL operation. We have also demonstrated the 
importance of the reverse pipeline by observing the increased dissipation in the form of an 
RC glitch when the reverse pipeline was disabled. Unfortunately, the experimental verifi- 
cation of the asymptotic behavior of SCRL continued to elude us. However, measurements 
of other quantities all agreed with that which is predicted under SCRL operation. 

One method that might succeed in measuring the energy dissipation of SCRL circuits 
would be to use an external inductor. By connecting an inductor to the a rail, we effectively 
have an RLC circuit. Through calibration, we could determine the amount of energy that 
is lost in the inductor at our operating frequency. Next we start the circuit oscillating 
and maintain it in that state by constantly replenishing the energy that is dissipated every 
cycle. By measuring this injected energy, and allowing for inductor dissipation, we should 
be able to determine the energy dissipation of the rail. We intend to do this in the near 
future. 
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7. Future Work 



To my knowledge, SCRL-1, is the first working implementation of a pipelined, reversible 
computation based, asymptotically zero energy logic. As such, I am certain that there is 
a lot more for us to discover and refine than what we have reported so far. This section 
attempts to give some suggestions about where future research in this area could proceed. 
My opinion is that as far as energy-conscious reversible computation is concerned, we've 
only just begun... 

7.1 SCRL-1 Energy Measurement 

Even though we are quite pleased with the results we've gotten so far, one important 
aspect of SCRL operation has not yet been verified: the linear reduction in energy consump- 
tion with the reduction in operating frequency. In the preceding section, I have outlined a 
number of attempts that were targeted towards verifying this SCRL property. I have also 
demonstrated that such a measurement could not be performed directly. In the future, we 
plan to re-attempt this measurement. Two options are now under investigation. The first 
is to use calibrated LC oscillating tank circuits, while the second is based on calorimetric 
techniques. The advantages and disadvantages that are offered by both are currently under 
consideration. 

7.2 CAD Work 

The main reason behind the likely resistance to the adoption of SCRL, is the requirement 
of reversible computation. Requiring designers to design the components for both the 
forward as well as for the reverse pipelines of the circuit is cumbersome and wasteful. This 
is because in defining the design of the forward modules, one also uniquely defines the 
function of the reverse and inverse modules. Admittedly, the real problem for the synthesis 
of the inverse modules is that of existence rather than definition. For an inverse module to 
exist, the function of the forward module it corresponds to must be bijective. Unfortunately, 
most simple logic functions are not bijective. However, our experience is that non-bijective 
functional modules could be embedded in a larger functional blocks that on the whole are 
bijective. For example, a NAND gate is not bijective, but a crossbar switch made out of a 
number of NAND gates is. To this end a CAD system could be a big help. An SCRL CAD 
system would take the specifications to only the forward modules as an input for synthesis. 
It would then attempt to section the total function of the forward pipeline into a number 
of modules placing the sectioning lines at points were the resulting modules are bijective, 
or at least are easily augmented to be so. Once that is accomplished, the rest is relatively 
straight forward and currently available CAD tools could be called upon to complete the 
designing process. In summary, the most important contribution of CAD in the area of 
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SCRL is in automating reversibility or at least advising the engineer upon the optimal place 
at which reversibility should be broken at times where it had to be. I believe work in this 
would prove to be both an interesting challenge as well as a rewarding opportunity. 

7.3 Architecture Issues 

As we are familiar, there is usually more than one way to perform a computing function, 
each being more suitable than the others under one or more constraints. Until recently, 
reversibility was not one of these constraints. In SCRL, reversibility is now the most 
important constraint. Architecture targeted for SCRL circuits should have reversibility 
in mind at all the levels of the design. This should continue to be true even under the 
scenario of readily available and efficient SCRL CAD tools. The obvious reason is the fact 
that implementations that are impossible to make reversible at the circuit level can be easily 
made so on the architectural level. Well before the existence of any real implementations 
of reversible logic, in 1981 Ressler [36] addressed some of the architectural as well as some 
of the CAD issues in building a conservative logic computer. 

7.4 Power Supply Switch 

In Section 2, we have seen how the dissipative effects of the power MOSFET in the 
power supply could significantly reduce the energy saving performance of charge recovery 
circuits. Even though there might be applications where lower dissipation in the computing 
circuits alone is just as useful, in general, one has to include the dissipation in the power 
supply when comparing SCRL to conventional CMOS. It should be accepted by now that if 
the power switch was built out of the same CMOS technology as the computing circuit, the 
best we can optimize for is an energy consumption that is related to the yT . However, this 
limit does not apply in circuits where the computing part and the power part are built out 
of different technologies. This presents another research opportunity, where the marriage 
of SCRL built with CMOS devices and a power switch of alternate technology leads to 
substantial reduction in the energy consumption of the system. 

7.5 SCRL Circuit Improvements 

One of the potential problems of multi-level SCRL, is the voltage bump that was de- 
scribed in Section 6.6.5. A number of remedies were also outlined in that Section. We 
think it is important to pursue the elimination of this effect in SCRL circuits since it could 
reduce the efficiency of the circuit. 

Finally, a major disadvantage of SCRL compared to the earlier implementations of CRL 
is the larger number of required rails. In going from CRL to SCRL, we managed to reduce 
the number of devices, cut down the number of wires and simplify the circuits considerably. 
Unfortunately, the number of required rails rose just as considerably. Our drive to improve 
CRL came from a feeling that "simple circuits should not be that complex." In following 
this hunch I was fortunate enough to stumble upon SCRL. 



91 



Faced now with the large number of required rails in SCRL, I'm thinking of trying this 
trick again, namely: 

"simple circuits should not be this complex." 
I have to admit though that I am bracing myself for a possible answer of... 

"Well... maybe they just are!" 
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A. CRL at Low Temperature 



It is generally known that operating CMOS systems at low temperatures improves a number 
of desirable device parameters. This appendix will attempt to show the higher compatibility 
between CRL and low temperature operation. 

A.l CRL and Large Systems at LNT 

Currently, a number of supercomputers operate at temperatures that are lower than 
room temperature. This trend we believe will continue because of the numerous additional 
device and system parameter improvements that happen at lower operating temperatures 
and which can no longer be ignored. In what follows we will enumerate these improvements 
that we believe will drive some designers to consider circuit operation at temperatures as 
low as 77K, the boiling point of liquid nitrogen at 1 atmosphere. We will then show how 
the combination of CRL techniques and low temperature operation produce unparalleled 
performance by each uniquely offsetting the disadvantages of the other. Our discussion will 
focus on Si MOSFET's visiting GaAs based devices only briefly. More precisely we will 
concentrate on enhancement mode MOSFET's which are the basis of CMOS technology. 

A large body of work exists that describe the relevant properties of semiconductor 
materials and devices at low temperature and a good selection of it has been collected 
in [24]. For silicon, the body of research suggests that most of the parameter improvements 
are attained at 77K. Operating at temperatures below 77K yields diminishing returns. For 
this reason, we will consider 77K as our low temperature operating point and compare the 
performance parameters there to those at room temperature. 

Material Parameters 

The first improvement in material parameters is the drop in electrical resistance with 
reduced operating temperature. Kirchman [25] reports the resistance of materials that 
are used in the interconnection traces of CMOS circuits dorps considerably at 77K. He 
reports that the conductivity increase is largest for Al interconnecting traces. In VLSI, 
lower interconnection resistance leads directly to faster operation. This is because in a 
dense planar circuit, interconnection RC is a major factor in determining operating speeds. 

In addition, operating at 77K holds the promise for off chip superconducting intercon- 
nections through using high temperature superconducting (HTS) compounds [29] for the 
traces in a multichip module. For CRL circuits, the energy saving ratio is limited by the Q 
of the inductor. But at 77K, we can use HTS material to build our inductors with Q's as 
high as 300,000. Initial investigation have shown that inductors with parameters suitable 
to CRL circuits, inductance in the microhenries and a critical current of at least 5-10 Amps, 
are in existence today. 
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The second advantage is the increase in thermal conductivity of many of the materi- 
als that are used in CMOS fabrication. Kirschman [25] reports that for "single crystal 
semiconductor, Si Ge GaAs; single crystal dielectric trials, sapphire, quarts, diamond; 
some polycrystaline materials, alumina and beryllia; and relatively pure metals, copper 
and aluminum," the thermal conductivity increases by as much as an order of magnitude 
by operating at lower temperatures. Thermal conductivity of p-type silicon is reported 
to be six times that at room temperature [6]. This is important since the power density 
of VLSI chips continues to increase as circuit density increases. On the other hand some 
materials currently employed in VLSI fabrication exhibit monotonically decreasing thermal 
conductivity with lower temperature such as polymers, glasses and metal alloys [19] [39]. 
However, the fact that most of these material are used in packaging suggests that careful 
material selection for packaging that is specific to low temperature operation could reduce 
the effect of these materials. 

The third is increased reliability. It is well known that failure rate is directly related 
to operating temperature. Little data exists that quantitatively compare the reliability of 
devices at 77K to those at room temperature. However, we can use Arhenuis relation to 
predict the factor of reliability improvement due to lower temperature. The Arhenuis rela- 
tion, formulated to predict the rate of chemical reactions, is widely employed in accelerated 
tests for device lot characterization [2]. It states that the rate at which a failure occurs, R, 
is given by 

R = R oe - Ea ' kT 

where Rq is a constant, E a the activation energy of the predominant failure mechanism, 
k the Boltzmann constant, and T is the absolute temperature. Obviously, the activation 
energy is an empirical parameter that depends highly on the fabrication process as well 
as the operating conditions and is known with relative certainty only for mature processes 
and technologies. Comparing the lifetime of devices at 77K to that at room temperature 
we calculate the improvement factor, 

F = R298 = e (-E a /k)(l/298-l/77) 
R77 

From MIL-HDBK-217E, E a = 0.7 for MOS devices and we get a 8.86 X 10 33 fold increase 
in the expected device lifetimes which is obviously overly optimistic. The error in our 
estimates stems from the assumption that the failure mechanism that is dominant at room 
temperature will continue to be the dominant one at 77K. The fact is that the relatively 
high rate of the mechanism that is dominant at room temperature completely masks other 
mechanisms that have lower activation energies and necessarily much lower i?o that is 
associated with their occurrence. As the temperature drops, the rate of occurrence of high 
E a processes drops off considerably faster, orders of magnitude faster, than the rate of 
processes with lower activation energy. It is not difficult to see how lower E a processes will 
become the dominant failure mechanism at some low temperature. 

A more empirical argument is that given in [23]. The argument is based on the lifetimes 
associated with an array of devices that operate at different temperatures. They used semi- 
conductor devices (operating at 300K), tubes and miniature incandescent lamps (operating 
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at 1000K to 1800K), and incandescent lamps (operating at 2800K tO 3300K) and found that 
the best fit line had a slope that drops as T 6 . Extrapolating to 77K gives a 3360 fold life- 
time improvement compared to that at room temperature. Note that this improvement will 
only be reached after some effort is directed towards identifying and rectifying the failure 
mechanism that are dominant at 77K and not just by cooling devices optimized for room 
temperature operation. The two reliability drawbacks of operating at low temperature, 
or temperature much lower than manufacturing temperature, are the mechanical stresses 
associated with temperature cycling of the components during power cycling and the dif- 
ficulty of service. We can reduce temperature cycling by maintaining the machine at low 
temperature even during power down. In addition, experiments involving thermal stressing 
of CMOS components shows that components manufactured primarily for room temper- 
ature environment continued to operate properly after repeated and accelerated thermal 
cycling [18] [28]. Hence we feel that components specifically designed for operation at 77K 
and storage at room temperature would have appreciable thermal cycling tolerance. As for 
service, fault tolerant machine architectures exhibiting graceful degradation, coupled with 
the reduced component failure rate at lower temperature should considerably reduce the 
need for cumbersome service. 

In addition to the above, we believe that CRL circuits operating at low temperatures 
would have higher reliability than conventional CMOS. During conventional CMOS switch- 
ing, the circuit temperature can rise above the average temperature for a short time. Since 
the failure rate is related to a high power of temperature, the overall failure rate will be 
higher than that predicted according to the average operating temperature. As CRL cir- 
cuits disallow sudden transients, the failure rate of CRL devices will be lower than that of 
conventional CMOS operating at the same temperature. In addition the peak currents in 
CRL are kept to minimum thus further reducing other failure effect such as metal migration, 
cross talk, etc. 

The fourth advantage of operating CMOS devices at 77K is the reduced leakage current 
of PN junctions. The reverse bias leakage current of a PN junction is exponentially related 
to temperature. Since CMOS devices use PN junctions for isolation, a reduction in leakage 
currents results in a reduction in power consumption due to leakage which is the major 
contributor to quiescent power consumption. Another avenue that lower leakage can lead to 
lower power consumption is in DRAM's. Currently, a DRAM module that is not used must 
continually be refreshed to retain the data. Typically, a DRAM cell not being refreshed loses 
its contents after 2ms. Operated at low temperature, DRAM's were shown to maintain data 
integrity without refreshing for up to 169 hours after which the experiment was stopped [32]. 
Similar results for other CMOS circuit were observed in [28] where the authors reported a 
quiescent current of about 10~ 10 Amps which was the limit of their test board isolation. 
This is even more important for CRL circuits. In CRL it is possible for a node to be 
charged and left unconnected for a number of clock cycles. Eventually, this node should 
be "refreshed" if it is to keep the desired voltage level. Like DRAM, the refresh is both 
dissipative and takes time away from computing. With lower leakage, both CRL and 
DRAM's achieve lower power consumption and in some cases higher utilization. 

The fifth is the advantage of lower operating voltage. To eliminate the probability of 
falsely triggering a circuit by thermal voltage, the threshold voltage of devices is chosen 
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to be larger than a safe multiple of kT/q. We therefore expect Vt and hence Vdd to scale 
down with temperature as the thermal voltage. Operating at 77K, we can get by with Vt 
that is 298/77 = 3.87 times lower than that required at room temperature and consuming 
15 times less power due to similarly scaled Vdd- Please note that energy saving due to CRL 
is on top of the savings gained here and provides an avenue for further reducing the energy 
per operation after the saving from lower Vdd has been pushed to the limit. 

Enhancement MOSFET parameters 

The most important parameter improvement of a MOSFET at 77K is the improved 
carrier mobility. A number of mechanisms limit the low-field carrier mobility in semicon- 
ductors. The first is lattice scattering caused by lattice vibration and is the dominant 
mechanism at room temperature. Lattice vibration scattering decreases as the tempera- 
ture drops below room temperature. The second is scattering caused by ionized impurities 
becoming important at very low temperatures, 4K, due to slower carrier thermal velocity. 
Since these effects track temperature differently, there is a point at which the effective mo- 
bility is maximum. Interestingly enough, this point is very close to 77K [34] [6] [14] [22]. At 
low-fields, low Vds experimenters found that mobility and hence conductance was 4 times 
higher at 77K than at room temperature [15]. At high fields, mobility was only 1.7 times 
that of room temperature. This is because at high electric fields, Vds on the order of few 
volts, the longer mean free path of the electrons at 77K coupled with high electric fields 
results in electrons attaining speeds that are higher than their thermal speeds, hot elec- 
trons. At this point scattering through optical phonons becomes very efficient at channeling 
energy from these fast electron to the lattice and hence the electron drift velocity reaches 
a maximum saturation velocity independent of the field applied. In a conventional CMOS 
circuit both low-field and high field situation occur leading to an increase in mobility at 
77K that is between 1.7 and 4 such as 2.4 [28]. Since the core premise of CRL circuits is 
to avoid switching devices while there is a potential across them, CRL operation is strictly 
in the low-field region. This means that CRL circuits will always observe of the 4 times 
increase in mobility at 77K as opposed to the effective 2.4 factor for conventional CMOS 
at 77K. Mobility is important in that it is directly related to transconductance. A higher 
transconductance leads to a lower RC time constant for the device and to faster operation. 
We therefore expect CRL devices to be 4 times faster at 77K. Another speed up mechanism 
is the reduction of the junction capacitance of the source and the drain with lower temper- 
ature. At 77K, the number of ionized impurities decreases due to the onset of "freeze out" 
and this lower concentration widens the space-charge region thus decreasing the effective 
capacitance of the reverse biased isolation PN junctions of the source and drain [22]. Note 
that carrier freeze out does not affect enhancement mode MOSFET's since the carriers in 
the channel do not come from the thermally ionized impurities of the channel but are due 
to band bending by the gate voltage. 

The second parameter improvement is the steeper sub threshold slope of the Ids versus 
Vqs curve [14]. Previously, we stated that we can get by with a lower Vt at 77K. Our 
argument was purely based on lower thermal voltage. However, low threshold voltage 
would result in higher sub threshold conduction resulting in higher power consumption due 
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to sub threshold leakage. Fortunately, the steeper slope, 4 times steeper, of the Ids versus 
Vqs curve that results at 77K reduces sub threshold conduction enough to allow for lower 
Vt and Vdd with the power saved by the lower voltages not lost because of sub threshold 
leakage. Please remember that energy consumption in CRL circuits due to irreversibility 
is related to Vt, not Vdd, an d would drop considerably as Vt becomes small at 77K. 

The third improvement is the fact that unlike conventional CMOS, the power delay 
product of CRL drops with temperature through improved RC products of the devices at 
77K. 

Enhancement MOSFET are more susceptible to hot electron injection into the gate oxide 
altering Vt [20]. However, CRL guarantees low-field conditions in the channel because of 
low VdS during switching thus drastically reducing the production of hot electrons in the 
channel. 

Another MOSFET parameter improvement is decreased susceptibility to latch-up [10]. 
This is due to poorer bipolar performance and reduced bulk resistance at 77K. In addition, 
we feel that latch-up is further reduced by CRL techniques due to the absence of transients 
and reduced capacitive coupling effects. 

For the above we see that the majority of material and device parameter improvements 
occur at 77K. Furthermore for all of the work referenced above normal CMOS operation of 
room temperature devices persisted down to 77K. Gaensslenn et al [15] have showed that 
equations modeling the behavior of enhancement mode MOSFET continued to tracked 
experimental results down to 77K with minor modifications. 

One adverse parameter change is a variation of the threshold voltage with temperature. 
Experimental data published in [15] [40] shows a 0.25 volts increase in Vt at 77K from that 
at room temperature. The results are stated to be the same for both long and short channel 
devices. This is not a serious problem in that this shift could be accurately predicted and 
compensated for when designing circuits that are intended for 77K operation. However 
testing of these compensated circuits at room temperature becomes a little tricky as Vt 
becomes very low resulting in false triggering and high sub threshold conduction. 

Finally, operating at temperatures lower than the temperature of the surrounding en- 
vironment requires additional power for refrigeration. The theoretical coefficient of refrig- 
eration, c, is the ratio of the work expended, W , to the heat pumped, Q, and is equal 

to 

= Q = T c 

C W T r -T c 

where T c is the temperature of the cooled device and T r is room, or surrounding environ- 
ment, temperature. For T c = 77K and T r = 300K, c = 0.34. This means that for every Joule 
our circuit dissipates at 77K, we have to supply a theoretical minimum of 1/c = 2.89 Joules 
for refrigeration. The 3.9 times power multiple due to refrigeration is well compensated for 
by CRL and the improved circuit characteristics at 77K. 

The above results are all attributed to experiments on devices that were optimized 
for room temperature operation. Less data is available for devices fabricated with LNT 
operation in mind. We believe that such devices should exhibit better parameter improve- 
ments due to their optimization. We want to distinguish however between optimization 
for LNT and operation limited to LNT. We think that devices limited to LNT had better 
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demonstrate a sizable improvement so as to justify forfeiting room temperature testing 
and characterization. Furthermore, our emphasis above has been on enhancement mode Si 
MOSFET's. We acknowledge the existence of other devices with far more superior charac- 
teristics such as MODFET's and Silicon-on-Insulator (SOI) devices. But since our research 
is circuits and systems oriented and because the advantages of CRL techniques run orthog- 
onal to improvements due to better devices and operating conditions, we limited our focus 
to Si MOSFET's. 
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B. SCRL-1 Pinout 



Pin # 


Pin Name 


Pin # 


Pin Name 


Pin # 


Pin Name 


1 


r4Pass<l> 


29 


aln<3> 


57 


f2Pass<l> 


2 


r4Pass<0> 


30 


aln<4> 


58 


f4SlowRail<0> 


3 


r4FastRail<l> 


31 


aln<5> 


59 


f4FastRail<l> 


4 


r4FastRail<0> 


32 


aln<6> 


60 


f4FastRail<0> 


5 


r4SlowRail<l> 


33 


bln<0> 


61 


f4FastRail<l> 


6 


r4SlowRail<0> 
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Figure B.l: Pinout map of the PGA84 package of SCRL-1. 
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